首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Tree is a data structure used to express various objects such as semistructured data and genes. When objects are represented as trees, computing tree similarity is essential for pattern recognition and retrieval. This paper considers the noisy subsequence tree recognition problem whose purpose is to recognize the original tree, given its noisy subsequence tree. Previous research on this problem relied on constrained tree edit distance to measure the dissimilarity. However, the number of relabelings must be predetermined to compute it. This paper proposes a new dissimilarity measure for this problem. Our dissimilarity measure is obtained by counting the node edit operations included in the unit‐cost tree edit distance that contribute to the matching of node labels. The number of relabelings need not be specified to compute our dissimilarity measure. Moreover, our measure achieves more accurate recognition performance and faster execution speed than the constrained tree edit distance. Our measure is also useful to solve the tree inclusion problem which is the problem of deciding whether a tree includes another tree and shows the extent of approximate tree inclusion when a tree incompletely includes another tree. © 2011 Wiley Periodicals, Inc.  相似文献   

3.
Semisupervised clustering algorithms partition a given data set using limited supervision from the user. The success of these algorithms depends on the type of supervision and also on the kind of dissimilarity measure used while creating partitions of the space. This paper proposes a clustering algorithm that uses supervision in terms of relative comparisons, viz., x is closer to y than to z. The proposed clustering algorithm simultaneously learns the underlying dissimilarity measure while finding compact clusters in the given data set using relative comparisons. Through our experimental studies on high-dimensional textual data sets, we demonstrate that the proposed algorithm achieves higher accuracy and is more robust than similar algorithms using pairwise constraints for supervision.  相似文献   

4.
This paper introduces a novel pairwise-adaptive dissimilarity measure for large high dimensional document datasets that improves the unsupervised clustering quality and speed compared to the original cosine dissimilarity measure. This measure dynamically selects a number of important features of the compared pair of document vectors. Two approaches for selecting the number of features in the application of the measure are discussed. The proposed feature selection process makes this dissimilarity measure especially applicable in large, high dimensional document collections. Its performance is validated on several test sets originating from standardized datasets. The dissimilarity measure is compared to the well-known cosine dissimilarity measure using the average F-measures of the hierarchical agglomerative clustering result. This new dissimilarity measure results in an improved clustering result obtained with a lower required computational time.  相似文献   

5.
This paper reports an experimental result obtained by additionally using unlabeled data together with labeled ones to improve the classification accuracy of dissimilarity-based methods, namely, dissimilarity-based classifications (DBC) [25]. In DBC, classifiers among classes are not based on the feature measurements of individual objects, but on a suitable dissimilarity measure among the objects instead. In order to measure the dissimilarity distance between pairwise objects, an approach using the one-shot similarity (OSS) [30] measuring technique instead of the Euclidean distance is investigated in this paper. In DBC using OSS, the unlabeled set can be used to extend the set of prototypes as well as to compute the OSS distance. The experimental results, obtained with artificial and real-life benchmark datasets, demonstrate that designing the classifiers in the OSS dissimilarity matrices instead of expanding the set of prototypes can further improve the classification accuracy in comparison with the traditional Euclidean approach. Moreover, the results demonstrate that the proposed setting does not work with non-Euclidean data.  相似文献   

6.
针对符号序列聚类中表示模型及序列间距离度量定义的困难问题,提出一种基于概率向量的表示模型及基于该模型的符号序列聚类算法。该模型引入符号序列的概率分布表示法,定义了一种基于概率分布差异的符号序列距离度量及该模型的目标函数,最后给出了一种符号序列K-均值型聚类算法,并在来自不同领域的实际应用序列集上进行了实验验证。实验结果表明,与基于子序列表示模型的符号序列聚类算法相比,所提方法在DNA序列和语音序列等具有较多符号的实际数据上,有效提高聚类精度的同时降低聚类时间50%以上。  相似文献   

7.
基于新的距离度量的K-Modes聚类算法   总被引:5,自引:1,他引:4  
传统的K-Modes聚类算法采用简单的0-1匹配差异方法来计算同一分类属性下两个属性值之间的距离, 没有充分考虑其相似性. 对此, 基于粗糙集理论, 提出了一种新的距离度量. 该距离度量在度量同一分类属性下两个属性值之间的差异时, 克服了简单0-1匹配差异法的不足, 既考虑了它们本身的异同, 又考虑了其他相关分类属性对它们的区分性. 并将提出的距离度量应用于传统K-Modes聚类算法中. 通过与基于其他距离度量的K-Modes聚类算法进行实验比较, 结果表明新的距离度量是更加有效的.  相似文献   

8.
9.
Mining and visualization of time profiled temporal associations is an important research problem that is not addressed in a wider perspective and is understudied. Visual analysis of time profiled temporal associations helps to better understand hidden seasonal, emerging, and diminishing temporal trends. The pioneering work by Yoo and Shashi Sekhar termed as SPAMINE applied the Euclidean distance measure. Following their research, subsequent studies were only restricted to the use of Euclidean distance. However, with an increase in the number of time slots, the dimensionality of a prevalence time sequence of temporal association, also increases, and this high dimensionality makes the Euclidean distance not suitable for the higher dimensions. Some of our previous studies, proposed Gaussian based dissimilarity measures and prevalence estimation approaches to discover time profiled temporal associations. To the best of our knowledge, there is no research that has addressed a similarity measure which is based on the standard score and normal probability to find the similarity between temporal patterns in z-space and retains monotonicity. Our research is pioneering work in this direction. This research has three contributions. First, we introduce a novel similarity (or dissimilarity) measure, SRIHASS to find the similarity between temporal associations. The basic idea behind the design of dissimilarity measure is to transform support values of temporal associations onto z-space and then obtain probability sequences of temporal associations using a normal distribution chart. The dissimilarity measure uses these probability sequences to estimate the similarity between patterns in z-space. The second contribution is the prevalence bound estimation approach. Finally, we give the algorithm for time profiled associating mining called Z-SPAMINE that is primarily inspired from SPAMINE. Experiment results prove that our approach, Z-SPAMINE is computationally more efficient and scalable compared to existing approaches such as Naïve, Sequential and SPAMINE that applies the Euclidean distance.  相似文献   

10.
面向混合属性数据集的改进半监督FCM聚类方法   总被引:1,自引:0,他引:1  
李晓庆  唐昊  司加胜  苗刚中 《自动化学报》2018,44(12):2259-2268
针对混合属性数据集聚类精度低的问题,本文提出一种基于改进距离度量的半监督模糊均值聚类(Fuzzy C-means,FCM)算法.首先,在数据集中针对类别属性进行预处理,并设置相应的相异度阈值;将传统聚类距离度量与改进的Jaccard距离度量结合,确定混合属性数据集的距离度量函数;最后,将所得距离度量函数与传统半监督FCM算法相结合,并在滚动轴承的不同复合故障数据的特征集中进行聚类.实验表明,该算法能在含无序属性的混合属性数据集的聚类中取得更好的聚类效果.  相似文献   

11.
The analysis and applications of adaptive-binning color histograms   总被引:1,自引:0,他引:1  
Histograms are commonly used in content-based image retrieval systems to represent the distributions of colors in images. It is a common understanding that histograms that adapt to images can represent their color distributions more efficiently than do histograms with fixed binnings. However, existing systems almost exclusively adopt fixed-binning histograms because, among existing well-known dissimilarity measures, only the computationally expensive Earth Mover’s Distance (EMD) can compare histograms with different binnings. This paper addresses the issue by defining a new dissimilarity measure that is more reliable than the Euclidean distance and yet computationally less expensive than EMD. Moreover, a mathematically sound definition of mean histogram can be defined for histogram clustering applications. Extensive test results show that adaptive histograms produce the best overall performance, in terms of good accuracy, small number of bins, no empty bin, and efficient computation, compared to existing methods for histogram retrieval, classification, and clustering tasks.  相似文献   

12.

Time profiled association mining is one of the important and challenging research problems that is relatively less addressed. Time profiled association mining has two main challenges that must be addressed. These include addressing i) dissimilarity measure that also holds monotonicity property and can efficiently prune itemset associations ii) approaches for estimating prevalence values of itemset associations over time. The pioneering research that addressed time profiled association mining is by J.S. Yoo using Euclidean distance. It is widely known fact that this distance measure suffers from high dimensionality. Given a time stamped transaction database, time profiled association mining refers to the discovery of underlying and hidden time profiled itemset associations whose true prevalence variations are similar as the user query sequence under subset constraints that include i) allowable dissimilarity value ii) a reference query time sequence iii) dissimilarity function that can find degree of similarity between a temporal itemset and reference. In this paper, we propose a novel dissimilarity measure whose design is a function of product based gaussian membership function through extending the similarity function proposed in our earlier research (G-Spamine). Our approach, MASTER (Mining of Similar Temporal Associations) which is primarily inspired from SPAMINE uses the dissimilarity measure proposed in this paper and support bound estimation approach proposed in our earlier research. Expression for computation of distance bounds of temporal patterns are designed considering the proposed measure and support estimation approach. Experiments are performed by considering naïve, sequential, Spamine and G-Spamine approaches under various test case considerations that study the scalability and computational performance of the proposed approach. Experimental results prove the scalability and efficiency of the proposed approach. The correctness and completeness of proposed approach is also proved analytically.

  相似文献   

13.
邱兴兴  程霄 《计算机应用》2013,33(9):1001-9081
针对空间分布复杂的数据以及空间分布未知的现实数据聚类问题,设计了一种改进流形距离作为不相似测度。该不相似测度可有效利用所有数据点之间的全局一致性,挖掘无类属数据集的空间分布信息。通过使用该不相似测度,提出了基于改进流形距离K-medoids算法。将新算法与基于已有的流形距离和基于欧氏距离的K-medoids算法进行性能比较,对八个人工数据集以及USPS手写体数字识别问题的实验结果表明:新算法针对不同结构的测试数据集,在聚类性能上均优于或接近于另外两种K-medoids算法,并且对于各种分布的,无论简单或复杂,凸或者非凸的数据都可以进行聚类。  相似文献   

14.
15.
16.
17.
黄华  颜恺  齐春 《自动化学报》2009,35(7):882-887
Hausdorff距离(Hausdorff distance, HD)是一种点集与点集之间的距离测度, 常用于目标物体的匹配、跟踪和识别等. 本文在分析经典HD及改进算法的基础上, 提出了一种基于相似度加权的自适应HD (Adaptive Hausdarff distance, AHD)算法. AHD算法利用不同点到点集的最小距离的个数作为匹配相似度的测量, 并舍弃对判断匹配几乎没有作用的较大的点到点集的最小距离值; 同时根据点到点集的最小距离自适应选择权值, 从而得到一种基于相似度测量加权系数; 通过利用部分点到点集的最小距离和基于相似度的加权平均, 既增强了算法的鲁棒性, 又尽可能地保证了算法的精度. 实验结果显示, AHD算法在匹配准确性、抵抗噪声和遮挡干扰等方面性能良好.  相似文献   

18.
A Distance-Based Attribute Selection Measure for Decision Tree Induction   总被引:12,自引:3,他引:9  
This note introduces a new attribute selection measure for ID3-like inductive algorithms. This measure is based on a distance between partitions such that the selected attribute in a node induces the partition which is closest to the correct partition of the subset of training examples corresponding to this node. The relationship of this measure with Quinlan's information gain is also established. It is also formally proved that our distance is not biased towards attributes with large numbers of values. Experimental studies with this distance confirm previously reported results showing that the predictive accuracy of induced decision trees is not sensitive to the goodness of the attribute selection measure. However, this distance produces smaller trees than the gain ratio measure of Quinlan, especially in the case of data whose attributes have significantly different numbers of values.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号