首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
张豪  陈黎飞  郭躬德 《计算机科学》2015,42(5):114-118, 141
符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等.作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题.首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性.在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类.在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度.  相似文献   

2.
Database applications often impose temporal dependencies between transactions that must be satisfied to preserve data consistency. The extant correctness criteria used to schedule the execution of concurrent transactions are either time independent or use strict, difficult to satisfy real-time constraints. On one end of the spectrum, serializability completely ignores time. On the other end, deadline scheduling approaches consider the outcome of each transaction execution correct only if the transaction meets its real-time deadline. In this article, we explore new correctness criteria and scheduling methods that capture temporal transaction dependencies and belong to, the broad area between these two extreme approaches. We introduce the concepts ofsuccession dependency andchronological dependency and define correctness criteria under which temporal dependencies between transactions are preserved even if the dependent transactions execute concurrently. We also propose achronological scheduler that can guarantee that transaction executions satisfy their chronological constraints. The advantages of chronological scheduling over traditional scheduling methods, as well as the main issues in the implementation and performance of the proposed scheduler, are discussed.  相似文献   

3.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

4.
Clustering is one of the most popular techniques in data mining. The goal of clustering is to identify distinct groups in a dataset. Many clustering algorithms have been published so far, but often limited to numeric or categorical data. However, most real world data are mixed, numeric and categorical. In this paper, we propose a clustering algorithm CAVE which is based on variance and entropy, and is capable of mining mixed data. The variance is used to measure the similarity of the numeric part of the data. To express the similarity between categorical values, distance hierarchy has been proposed. Accordingly, the similarity of the categorical part is measured based on entropy weighted by the distances in the hierarchies. A new validity index for evaluating the clustering results has also been proposed. The effectiveness of CAVE is demonstrated by a series of experiments on synthetic and real datasets in comparison with that of several traditional clustering algorithms. An application of mining a mixed dataset for customer segmentation and catalog marketing is also presented.  相似文献   

5.
Various data mining methods have been developed last few years for hepatitis study using a large temporal and relational database given to the research community. In this work we introduce a novel temporal abstraction method to this study by detecting and exploiting temporal patterns and relations between events in viral hepatitis such as “event A slightly happened before event B and B simultaneously ended with event C”. We developed algorithms to first detect significant temporal patterns in temporal sequences and then to identify temporal relations between these temporal patterns. Many findings by data mining methods applied to transactions/graphs of temporal relations shown to be significant by physician evaluation and matching with published in Medline.  相似文献   

6.
基于函数依赖的结构匹配方法   总被引:2,自引:0,他引:2  
李国徽  杜小坤  胡方晓  杨兵  唐向红 《软件学报》2009,20(10):2667-2678
模式匹配是模式集成、数据仓库、电子商务以及语义查询等领域中的一个基础问题,近来已经成为研究的热点,并取得了丰硕的成果.这些成果主要利用元素(典型的为关系模式中的属性)自身的信息来挖掘元素语义,目前,这方面的研究已经相当成熟.结构信息作为模式中一种重要的信息,能够为提高模式匹配的精确性提供有用的支持,但是目前关于如何利用结构信息提高模式匹配的精确性的研究还很少.将模式元素之间的相似度分为语义相似度(根据元素自身信息得到的相似度)和结构相似度(根据元素之间的关联关系得到的相似度),并采用新的统计方法计算元素间的结构相似度,然后再综合考虑语义相似度得到元素间的相似概率;最后根据相似概率得到模式元素间的映射关系(模式元素之间的对应关系).实验结果表明,该算法在查准率、查全率及全面性等方面都优于已有的其他算法.  相似文献   

7.
A phenomenal growth in the number of credit card transactions, especially for online purchases, has recently led to a substantial rise in fraudulent activities. Implementation of efficient fraud detection systems has thus become imperative for all credit card issuing banks to minimize their losses. In real life, fraudulent transactions are interspersed with genuine transactions and simple pattern matching is not often sufficient to detect them accurately. Thus, there is a need for combining both anomaly detection as well as misuse detection techniques. In this paper, we propose to use two-stage sequence alignment in which a profile analyzer (PA) first determines the similarity of an incoming sequence of transactions on a given credit card with the genuine cardholder's past spending sequences. The unusual transactions traced by the profile analyzer are next passed on to a deviation analyzer (DA) for possible alignment with past fraudulent behavior. The final decision about the nature of a transaction is taken on the basis of the observations by these two analyzers. In order to achieve online response time for both PA and DA, we suggest a new approach for combining two sequence alignment algorithms BLAST and SSAHA.  相似文献   

8.
针对单病种临床序列大都具有类似的时序频繁模式,提出一种基于马尔可夫模型的临床序列检测模型。采用编辑距离算法对临床序列进行数据转换,构造其特征空间;根据特征空间中频繁模式在临床序列中的时序构建马尔可夫模型,获取模型的参数;将参数和待检序列代入检测模型进行频繁模式迁移支持概率的计算;比对计算结果与给定阈值偏差,确定临床行为的异常性。实验结果表明,在选取合适的参数值的基础上,可有效的检测出异常的临床行为。  相似文献   

9.
Mining sequential patterns is to discover sequential purchasing behaviours for most of the customers from a large number of customer transactions. The strategy of mining sequential patterns focuses on discovering frequent sequences. A frequent sequence is an ordered list of the itemsets purchased by a sufficient number of customers. The previous approaches for mining sequential patterns need to repeatedly scan the database so that they take a large amount of computation time to find frequent sequences. The customer transactions will grow rapidly in a short time, and some of the customer transactions may be antiquated. Consequently, the frequent sequences may be changed due to the insertion of new customer transactions or the deletion of old customer transactions from the database. It may require rediscovering all the patterns by scanning the entire updated customer transaction database. In this paper, we propose an incremental updating technique to maintain the discovered sequential patterns when transactions are inserted into or deleted from the database. Our approach partitions the database into some segments and scans the database segment by segment. For each segment scan, our approach prunes those sequences that cannot be frequent sequences any more to accelerate the finding process of the frequent sequences. Therefore, the number of database scans can be significantly reduced by our approach. The experimental results show that our algorithms are more efficient than other algorithms for the maintenance of mining sequential patterns.  相似文献   

10.
A variety of clustering algorithms exists to group objects having similar characteristics. But the implementations of many of those algorithms are challenging in the process of dealing with categorical data. While some of the algorithms cannot handle categorical data, others are unable to handle uncertainty within categorical data in nature. This is prerequisite for clustering categorical data which also deal with uncertainty. An algorithm, termed minimum-minimum roughness (MMR) was proposed, which uses the rough set theory in order to deal with the above problems in clustering categorical data. Later many algorithms has developed to improve the handling of hybrid data. This research proposes information-theoretic dependency roughness (ITDR), another technique for categorical data clustering taking into account information-theoretic attributes dependencies degree of categorical-valued information systems. In addition, it is second to none of all its predecessors; MMR, MMeR, SDR and standard-deviation of standard-deviation roughness (SSDR). Experimental results on two benchmark UCI datasets show that ITDR technique is better with the baseline categorical data clustering technique with respect to computational complexity and the purity of clusters.  相似文献   

11.
聚类是数据挖掘中重要的技术之一,它是按照相似原则将数据进行分类。然而分类型数据的聚类是学习算法中重要而又棘手的问题。传统的k-modes算法采用简单的0-1匹配方法定义两个属性值之间的相异度,没有将整个数据集的分布考虑进来,导致差异性度量不够准确。针对这个问题,提出基于结构相似性的k-modes算法。该算法不仅考虑属性值它们本身的异同,而且考虑了它们在其他属性下所处的结构。从集群识别和准确率两个方面进行仿真实验,表明基于结构相似性的k-modes算法在伸缩性和准确率方面更有效。  相似文献   

12.
由于符号型数据缺乏清晰的空间结构,很难构造一种合理的相似性度量,从而使诸多数值型聚类算法难以推广至符号型数据聚类.基于此种情况,文中引入一种空间结构表示方法,把符号型数据转化为数值型数据,能够在保持原符号型数据的结构特征的基础上重新构造样本之间的相似度.基于此方法,将仿射传播(AP)聚类算法迁移至符号数据聚类中,提出基于空间结构的符号数据AP算法(SBAP).在UCI数据集中若干符号型数据集上的实验表明,SBAP可以使AP算法有效处理符号型数据聚类问题,并且可以提升算法性能.  相似文献   

13.
常伟鹏  袁泉 《计算机仿真》2021,38(1):331-335
对网络信息实体进行关联匹配,能够更好的实现网络数据的传递和分析。由于网络数据呈现多源异构,以及非均匀分布等特征,导致难以对其信息实体进行准确快速的关联匹配。由此,提出了融合多模式匹配的网络信息实体关联策略。策略考虑了网络信息实体的复杂性与动态性,首先设计了语法相似性,对大量简单信息实体进行快速匹配;然后基于深度与距离设计了语义相似性,对实体中包含的词干与复合词汇进行准确匹配;再利用数据类型建立类型相似性,对缺失信息的实体进行匹配;最后通过编辑距离与惩戒函数,设计了结构性相似度,对实体之间上下文依赖与约束进行匹配。根据实验结果,验证了融合多模式匹配的网络信息实体关联策略具有灵敏的区分能力,并且在匹配准确度和匹配效率上均取得了显著的性能优化效果,能够有效应对网络信息实体的异构与分布特性。  相似文献   

14.
Cluster analysis, or clustering, refers to the analysis of the structural organization of a data set. This analysis is performed by grouping together objects of the data that are more similar among themselves than to objects of different groups. The sampled data may be described by numerical features or by a symbolic representation, known as categorical features. These features often require a transformation into numerical data in order to be properly handled by clustering algorithms. The transformation usually assigns a weight for each feature calculated by a measure of importance (i.e., frequency, mutual information). A problem with the weight assignment is that the values are calculated with respect to the whole set of objects and features. This may pose as a problem when a subset of the features have a higher degree of importance to a subset of objects but a lower degree with another subset. One way to deal with such problem is to measure the importance of each subset of features only with respect to a subset of objects. This is known as co-clustering that, similarly to clustering, is the task of finding a subset of objects and features that presents a higher similarity among themselves than to other subsets of objects and features. As one might notice, this task has a higher complexity than the traditional clustering and, if not properly dealt with, may present an scalability issue. In this paper we propose a novel co-clustering technique, called HBLCoClust, with the objective of extracting a set of co-clusters from a categorical data set, without the guarantees of an enumerative algorithm, but with the compromise of scalability. This is done by using a probabilistic clustering algorithm, named Locality Sensitive Hashing, together with the enumerative algorithm named InClose. The experimental results are competitive when applied to labeled categorical data sets and text corpora. Additionally, it is shown that the extracted co-clusters can be of practical use to expert systems such as Recommender Systems and Topic Extraction.  相似文献   

15.
李鸣  张鸿 《计算机应用》2016,36(10):2822-2825
基于内容的图像检索一直面临"语义鸿沟"的难题,特征选择对语义学习结果有着直接的影响;而传统距离度量方法往往从单一角度进行相似性计算,不能很好地表示出图像之间的相似度。为了解决以上问题,提出基于深度特征分析的双线性图像相似度匹配的方法。首先,将图像数据集在卷积神经网络模型上进行微调训练,然后利用训练好的卷积神经网络对图像进行特征提取,获得全连接层输出的特征之后,通过双线性相似性度量方法得到图像间相似度的大小,通过对相似度的大小排序,返回最相似的图像实例。在Caltech101和Caltech256数据集上的对比实验显示,所提算法的平均查准率、TopK查准率和查全率均优于对比算法,验证了所提算法的有效性。  相似文献   

16.
王丰  王亚沙  赵俊峰  崔达 《软件学报》2019,30(5):1510-1521
语义网的飞速发展,使得各领域出现了以本体这种形式来表达的知识模型.但在实际的语义网应用中,常常面临本体实例匮乏的问题.将现有关系型数据源中的数据转化为本体实例是一种有效的解决办法,这需要利用关系模型到本体模型的模式匹配技术来建立数据源和本体之间的映射关系.除此之外,关系模型到本体模型的模式匹配还被广泛用于数据集成、数据语义标注、基于本体的数据访问等领域中.现有的研究工作往往会综合使用多种模式匹配算法,计算异构数据模式中元素对的综合相似度,辅助人工建立数据源到本体的映射关系.现有的工作针对单一模式匹配算法准确率不高的问题,试图通过综合多种模式匹配算法的结果来进行调和.然而,这种方法当多种匹配算法同时出现不准时,难以得出更加准确的最终匹配结果.对单一模式匹配算法匹配不准的成因进行深入的分析,认为数据源的本地化特征是导致这一现象的重要因素,并提出了一种迭代优化的模式匹配方案.该方案利用在模式匹配过程中已经得到匹配的元素对,对单一模式匹配算法进行优化,经过优化后的算法能够更好地兼容数据源的本地化特征,从而显著提升准确率.在"餐饮信息管理"领域的一个实际案例上开展实验,模式匹配效果显著高于传统方法,其中,F值超过传统方法50.1%.  相似文献   

17.
Biomedical waveforms, such as electrocardiogram (ECG) and arterial pulse, always possess a lot of important clinical information in medicine and are usually recorded in a long period of time in the application of telemedicine. Due to the huge amount of data, to compress the biomedical waveform data is vital. By recognizing the strong similarity and correlation between successive beat patterns in biomedical waveform sequences, an efficient data compression scheme mainly based on pattern matching is introduced in this paper. The waveform codec consists mainly of four units: beat segmentation, beat normalization, two-stage pattern matching and template updating and residual beat coding. Three different residual beat coding methods, such as Huffman/run-length coding, Huffman/run-length coding in discrete cosine transform domain, and vector quantization, are employed. The simulation results show that our compression algorithms achieve a very significant improvement in the performances of compression ratio and error measurement for both ECG and pulse, as compared with some other compression methods.  相似文献   

18.
The security of computers and their networks is of crucial concern in the world today. One mechanism to safeguard information stored in database systems is an Intrusion Detection System (IDS). The purpose of intrusion detection in database systems is to detect malicious transactions that corrupt data. Recently researchers are working on using data mining techniques for detecting such malicious transactions in database systems. Their approach concentrates on mining data dependencies among data items. However, the transactions not compliant with these data dependencies are identified as malicious transactions. Algorithms that these approaches use for designing their data dependency miner have limitations. For instance, they need to experimentally determine appropriate settings for minimum support and related constraints, which does not necessarily lead to strong data dependencies. In this paper we propose a new data mining algorithm, called the Optimal Data Access Dependency Rule Mining (ODADRM), for designing a data dependency miner for our database IDS. ODADRM is an extension of k-optimal rule discovery algorithm, which has been improved to be suitable in database intrusion detection domain. ODADRM avoids many limitations of previous data dependency miner algorithms. As a result, our approach is able to track normal transactions and detect malicious ones more effectively than existing approaches.  相似文献   

19.
Multidimensional data sets often include categorical information. When most dimensions have categorical information, clustering the data set as a whole can reveal interesting patterns in the data set. However, the categorical information is often more useful as a way to partition the data set: gene expression data for healthy versus diseased samples or stock performance for common, preferred, or convertible shares. We present novel ways to utilize categorical information in exploratory data analysis by enhancing the rank-by-feature framework. First, we present ranking criteria for categorical variables and ways to improve the score overview. Second, we present a novel way to utilize the categorical information together with clustering algorithms. Users can partition the data set according to categorical information vertically or horizontally, and the clustering result for each partition can serve as new categorical information. We report the results of a longitudinal case study with a biomedical research team, including insights gained and potential future work.  相似文献   

20.
于亚君  姜瑛 《计算机工程与应用》2012,48(20):177-181,210
基于XML树的匹配已被广泛应用于数据挖掘、自然语言自处理、图像检索等领域。通过分析现有的基于XML树的匹配度计算方法,发现存在对计算的前期要求(如权值分割)太过严格、匹配度结果存在误差等问题,影响了匹配的精度和效率。基于XML的内容约束和结构约束,综合结点相似度和层次相似度,提出一种结构相似度计算公式,改进了匹配计算结果的准确度,并通过实验验证了公式的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号