首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A subsequence of a given string is any string obtained by deleting none or some symbols from the given string. A longest common subsequence (LCS) of two strings is a common subsequence of both that is as long as any other common subsequences. The problem is to find the LCS of two given strings. The bound on the complexity of this problem under the decision tree model is known to be mn if the number of distinct symbols that can appear in strings is infinite, where m and n are the lengths of the two strings, respectively, and m⩽n. In this paper, we propose two parallel algorithms far this problem on the CREW-PRAM model. One takes O(log2 m + log n) time with mn/log m processors, which is faster than all the existing algorithms on the same model. The other takes O(log2 m log log m) time with mn/(log2 m log log m) processors when log2 m log log m > log n, or otherwise O(log n) time with mn/log n processors, which is optimal in the sense that the time×processors bound matches the complexity bound of the problem. Both algorithms exploit nice properties of the LCS problem that are discovered in this paper  相似文献   

2.
ESSK:一种计算点击流相似度的新方法   总被引:1,自引:0,他引:1  
用户点击流信息被广泛应用于Web使用信息挖掘中。点击流相似度常用于用户会话分类和聚类。SSK(String Subsequence Kernel)最初被用于计算字符串相似度,后被引入计算点击流相似度,并成为目前常用方法之一。SSK选择两个字符串所有长度为k的子序列生成特征空间。单一k的选择往往存在特征数不足的问题,从而难以获得足够精确的点击流相似度。因此,提出一种新的点击流相似度计算方法ESSK(Extended String Subsequence Ker-nel)。ESSK采用所有子序列生成特征空间以解决SSK存在的问题。同时提出一种高效计算ESSK的算法,以降低计算复杂度。实验表明,ESSK比SSK更精确,比其它方法具有更高的区分度,因此更适合点击流相似度分析和应用。  相似文献   

3.
Reliable detection of episodes in event sequences   总被引:3,自引:1,他引:2  
Suppose one wants to detect bad or suspicious subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is significant and should be a cause for alarm depends on how likely it is to occur fortuitously. A long-enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence, T, of observed events contains a suspicious subsequence pattern, S, within it, where the suspicious subsequence S consists of m events and spans a window of size w within T. We address the fundamental problem: Is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e. indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system flags it as suspicious anyway, then such a system will suffer from generating too many false alarms. This paper quantifies the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows setting up an alarm threshold so that the probability of false alarms is very small. We report on experiments confirming the theory and showing that we can detect bad subsequences with low false alarm rate.  相似文献   

4.

Discovering task subsequences from a continuous video stream facilitates a robot imitation of sequential tasks. In this research, we develop unsupervised learning of the task subsequences which does not require a human teacher to give the supervised label of the subsequence. Task-discriminative feature, in the form of sparsely activated cells called task capsules, is proposed for self-training to preserve spatio-semantic information of a visual input. The task capsules are sparsely and exclusively activated with respect to the spatio-semantic context of the task subsequence: a type and location of the object. Therefore, the generalized purpose in multiple videos is unsupervisedly discovered according to the spatio-semantic context, and the demonstration is segmented into the task subsequences in an object-centric way. In comparison with the existing studies on unsupervised task segmentation, our work has the following distinct contribution: 1) the task provided as a video stream can be segmented without any pre-defined knowledge, 2) the trained features preserve spatio-semantic information so that the segmentation is object-centric. Our experiment shows that the recognition of the task subsequence can be applied to robot imitation for a sequential pick-and-place task by providing the semantic and location information of the object to be manipulated.

  相似文献   

5.
文中所提m子序列是根据m序列的状态转换特征,通过交叉改变状态转换次序而形成新的序列。通过随机性测试软件(NIST)验证m子序列具有和m序列相似的随机性,使用BM算法可以得出这种伪随机序列具有非常高的线性复杂度,同时验证了其补序列也具有非常高的线性复杂度,并说明m子序列具有良好的线性复杂度谱,抗线性攻击能力强。m子序列的数量庞大,一个周期为 的m序列,改变反馈函数就可以至少产生 个m子序列。产生m子序列的反馈函数经证明具有良好的代数免疫度,抗代数攻击能力较强。m子序列具有良好的密码学性质,应用前景良好。  相似文献   

6.
Words that appear as constrained subsequences in a text-string are considered as possible indicators of the host string structure, hence also as a possible means of sequence comparison and classification. The constraint consists of imposing a bound on the number ωω of positions in the text that may intervene between any two consecutive characters of a subsequence. A subset of such ωω-sequences is then characterized that consists, in intuitive terms, of sequences that could not be enriched with more characters without losing some occurrence in the text. A compact spatial representation is then proposed for these representative sequences, within which a number of parameters can be defined and measured. In the final part of the paper, such parameters are empirically analyzed on a small collection of text-strings endowed with various degrees of structure.  相似文献   

7.
时间序列中快速模式发现算法的研究   总被引:3,自引:0,他引:3  
针对长时间序列,该文提出了一种新的能快速发现序列中时序模式的检索方法。首先将时间序列分成若干等长的子序列;接着从每个子序列中提取特征序列,该特征序列能够反映子序列中数据的变化趋势;然后根据每个特征序列将相应的子序列分配到一系列盒子中,使得不同盒子中的子序列因数据变化趋势不同而不相似,而在同一盒子中的序列由于数据变化趋势相同而有可能相似;最后通过计算每个盒子中任意两个子序列间的欧几里德距离来发现所有的模式。有关实验证明该算法是行之有效的。  相似文献   

8.
孙焘  朱晓明 《计算机科学》2017,44(2):270-274
多条序列的最长公共子序列可以代表多条序列的公共信息,其在诸多领域里有着重要的应用,如信息检索、基因序列匹配等。求解多条序列的最长公共子序列是著名的NP难问题,本质为多解问题。一些近似算法虽然时间复杂度较低,但只能求出单解,对于有多解的序列集合,求得的结果信息量损失较大。因此提出一个新的近似算法来解决最长公共子序列问题。算法引入了代数结构“格”,通过动态规划求解出两条序列的公共格,并递归求解当前格与当前序列的公共格。公共格中的路径保存了多条公共子序列使得最终求解出的最长公共子序列为多个。对算法的相关定理给出了理论证明,并通过实验验证了算法的正确性。  相似文献   

9.
针对时间序列子序列聚类存在的平凡相似和水平伸缩等问题,提出了一种新的子序列聚类算法。它采用多孔平滑滤波器组对时间序列进行低通平滑处理,在所得到的多个尺度序列上生成平凡簇,然后将各个平凡簇的代表子序列作为数据样本进行聚类。新方法利用平凡簇克服了子序列聚类中的平凡相似问题,并且可以在时间序列上发现不等长的相似子序列,较好地解决了水平轴伸缩问题。实验结果证明新算法对于子序列聚类具有比较好的效果。  相似文献   

10.
We present a new approach to motion rearrangement that preserves the syntactic structures of an input motion automatically by learning a context‐free grammar from the motion data. For grammatical analysis, we reduce an input motion into a string of terminal symbols by segmenting the motion into a series of subsequences, and then associating a group of similar subsequences with the same symbol. To obtain the most repetitive and precise set of terminals, we search for an optimial segmentation such that a large number of subsequences can be clustered into groups with little error. Once the input motion has been encoded as a string, a grammar induction algorithm is employed to build up a context‐free grammar so that the grammar can reconstruct the original string accurately as well as generate novel strings sharing their syntactic structures with the original string. Given any new strings from the learned grammar, it is straightforward to synthesize motion sequences by replacing each terminal symbol with its associated motion segment, and stitching every motion segment sequentially. We demonstrate the usefulness and flexibility of our approach by learning grammars from a large diversity of human motions, and reproducing their syntactic structures in new motion sequences.  相似文献   

11.
子序列查询技术在金融、商业、医疗等领域均有重要应用,但因DTW(dynamic time warping)等相似性比对算法的时间复杂度较高,子序列长度对检索时间影响很大,限制了数据集上长子序列检索的效率。针对这一问题提出一种子序列快速查询算法。首先对数据集中特定长度下所有子序列进行分组并标记出代表性子序列;然后在查询时将查询序列切分成定长的小段序列,并用DTW算法确定与小段序列相似的代表子序列候选集;最后对候选集进行序列拼接,获取到查询结果序列。实验表明新算法效率较典型算法提高约10倍。  相似文献   

12.
时态数据的趋势序列分析及其子序列匹配算法研究   总被引:1,自引:0,他引:1  
针对时态数据挖掘中传统趋势序列分析的缺点,提出了数字趋势序列、趋势序列展开等概念.根据数字趋势序列的特点,使用片段斜率所对应的弧度值来度量片段的趋势.针对数字趋势序列的子序列匹配问题,设计了"DTW双约束快速搜索算法".算法分为3个部分:DTW顺序搜索、双约束机制、冗余消除机制,其中DTW顺序搜索构成了算法的基本框架,双约束机制加快了DTW距离的计算过程,冗余消除机制消除了最终结果集中的冗余.  相似文献   

13.
Subsequence matching is an operation that finds subsequences whose changing patterns are similar to a given query sequence from time-series databases. This paper identifies a performance bottleneck in subsequence matching, and then proposes an effective method that substantially improves the performance of entire subsequence matching by resolving the performance bottleneck. First, we analyze the disk access and CPU processing times required during the index searching and post-processing steps of subsequence matching through preliminary experiments. Based on these results, we show that the post-processing step is a main performance bottleneck in subsequence matching. Then, we argue that the optimization of the post-processing step is a crucial issue overlooked in previous approaches. In order to resolve the performance bottleneck, we propose a simple yet highly effective method for expediting the post-processing step. By rearranging the order of candidate subsequences to be compared with a query sequence, our method completely eliminates the redundancies of disk accesses and CPU processing that occur in the post-processing step. Our method is fairly efficient, and does not incur any false dismissal. We quantitatively demonstrate the superiority of our method through extensive experimentation. The results show that our method produces a significantly faster post-processing step; When using a data set of real-world stock sequences, our method was 43.36-96.75 times faster than previous methods, and when using data sets of large numbers of synthetic sequences, our method was 12.48-26.95 times faster than previous methods. Also, the results show that our method reduces the weight of the post-processing step over entire subsequence matching from more than 97% to less than 67%. This implies that our method successfully resolves the performance bottleneck in subsequence matching. As a result, our method provides excellent performance in entire subsequence matching. Compared with previous methods, our method is 16.17-32.64 times faster when using a data set of real-world stock sequences and 8.64-14.29 times faster when using data sets of large numbers of synthetic sequences.  相似文献   

14.
The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.  相似文献   

15.
Partitioning a sequence into few monotone subsequences   总被引:1,自引:0,他引:1  
In this paper we consider the problem of finding sets of long disjoint monotone subsequences of a sequence of numbers. We give an algorithm that, after preprocessing time, finds and deletes an increasing subsequence of size (if it exists) in time . Using this algorithm, it is possible to partition a sequence of numbers into monotone subsequences in time . Our algorithm yields improvements for two applications: The first is constructing good splitters for a set of lines in the plane. Good splitters are useful for two dimensional simplex range searching. The second application is in VLSI, where we seek a partitioning of a given graph into subsets, commonly refered to as the pages of a book, where all the vertices can be placed on the spine of the book, and each subgraph is planar. Received: 23 July 1990 / 19 June 1997  相似文献   

16.
支持向量机是一种比较新的机器学习方法,它满足结构风险最小的要求,并且能够适用于高维的特征空间,因此在生物序列分析中得到了广泛地应用。结合基因序列的特点,提出了一种新的核函数--位置权重子序列核函数。这个核函数融合了基因序列中子序列的组成特征和位置信息,能够比较充分地体现序列特征。将这个核函数用于基因剪接位点的识别分析,得到的结果表明,采用了位置权重子序列核函数的支持向量机能够很好的识别剪接位点,与其它方法相比,取得了更高的识别精度。  相似文献   

17.
Proposed is a new approach to task segmentation in a mobile robot by a modular network SOM (mnSOM). In a mobile robot the standard mnSOM is not applicable as it is, because it is based on the assumption that class labels are known a priori. In a mobile robot, only a sequence of data without segmentation is available. Hence, we propose to decompose it into many subsequences, supposing that a class label does not change within a subsequence. Accordingly, training of mnSOM is done for each subsequence in contrast to that for each class in the standard mnSOM. The resulting mnSOM demonstrates good segmentation performance of 94.05% for a novel dataset.  相似文献   

18.
基于符号化表示的时间序列频繁子序列挖掘   总被引:1,自引:0,他引:1       下载免费PDF全文
提出一种新的基于符号化表示的时间序列频繁子序列的挖掘算法。利用基于PAA的分段线性表示法进行降维,通过在高斯分布下设置断点,实现时间序列符号化表示,利用投影数据库挖掘频繁子序列。该算法简单、新颖,运行快速,简化了子序列支持数的计算。  相似文献   

19.
A time-series database is a set of data sequences, each of which is a list of changing values of an object in a given period of time. Subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence in a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We claim that index interpolation is a fairly effective tool to solve this problem. Index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their distinct sizes. For index interpolation, we need to decide the sizes of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes from the perspective of physical database design. Given a set of pairs 〈lengthfrequency〉 of query sequences to be performed in a target application and a set of window sizes for building multiple indexes, we devise a formula that estimates the overall cost of all the subsequence matchings performed in a target application. By using this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally prove the optimality as well as the effectiveness of the algorithm. Finally, we show the superiority of our approach by performing extensive experiments with a real-life stock data set and a large volume of synthetic data sets.  相似文献   

20.
Anomaly detection has received much attention due to its various applications. Generally, the first step to discover anomalies is a data representation method which reduces dimensionality as well as preserves key information. Anomaly detection based on real-value representation methods is meaningful for its convenience in numeric operation. A typical real-value representation method is the Piecewise Aggregate Approximation (PAA) that is simple and intuitive by capturing mean values of segments in a sequence. However, if segments are same or similar in their average values but different in their oscillation amplitudes, the PAA method is ineffective to describe a sequence composed of such segments. To address this issue, we propose a representation method called the Piecewise Aggregate Approximation in the Amplitude Domain (AD-PAA). For discovering anomalies, a sequence is partitioned into subsequences by a sliding window firstly. Then in the AD-PAA method, a subsequence is divided into equal size subsections according to the amplitude domain. With mean values of subsections computed, the amplitude oscillation of a subsequence is embodied effectively. When the AD-PAA method is applied to approximate subsequences, the AD-PAA representation of a sequence is constructed. Anomalies are determined by anomaly scores that are based on similarities among representation results. Experimental results on various data confirm that the proposed method is more accurate than the PAA based method and other comparison methods. The ability to differentiate anomalies of the proposed algorithm is also superior.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号