首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Knowledge》2007,20(4):336-349
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis.  相似文献   

2.
基于树编辑距离的层次聚类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
为了识别犯罪嫌疑人伪造和篡改的虚假身份,利用树编辑距离计算个体属性相似性,证明了树编辑距离的相关数学性质,对属性应用层次编码方法,提出了一种新的基于树编辑距离的层次聚类算法HCTED(Hi-erarchical Clustering Algorithm Based on Tree Edit Distance)。新算法通过树编辑操作使用最少的代价计算属性相似性,克服了传统聚类算法标称型计算的缺陷,提高了聚类精度,通过设定阈值对给定样本聚类。实验证明了新方法在身份识别上的准确性和有效性,讨论了不同参数对实验结果的影响,对比传统聚类算法,HCTED算法性能明显提高。新算法已经应用到警用流动人口分析中,取得了良好效果。  相似文献   

3.
王靖 《计算机应用研究》2020,37(10):2951-2955,2960
针对同类文本中提取的关键词形式多样,且在相似性与相关性上具有模糊关系,提出一种对词语进行分层聚类的文本特征提取方法。该方法在考虑文本间相同词贡献文本相似度的前提下,结合词语相似性与相关性作为语义距离,并根据该语义距离的不同,引入分层聚类并赋予不同聚类权值的方法,最终得到以词和簇共同作为特征单元的带有聚类权值的向量空间模型。引入了word2vec训练词向量得到文本相似度,并根据Skip-Gram+Huffman Softmax模型的算法特点,运用点互信息公式准确获取词语间的相关度。通过文本的分类实验表明,所提出的方法较目前常用的仅使用相似度单层聚类后再统计的方法,能更有效地提高文本特征提取的准确性。  相似文献   

4.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

5.
Incremental clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Clustering is an important function in data mining. Its typical application includes the analysis of consumer’s materials. Adaptive resonance theory network (ART) is very popular in the unsupervised neural network. Type I adaptive resonance theory network (ART1) deals with the binary numerical data, whereas type II adaptive resonance theory network (ART2) deals with the general numerical data. Several information systems collect the mixing type attitudes, which included numeric attributes and categorical attributes. However, ART1 and ART2 do not deal with mixed data. If the categorical data attributes are transferred to the binary data format, the binary data do not reflect the similar degree. It influences the clustering quality. Therefore, this paper proposes a modified adaptive resonance theory network (M-ART) and the conceptual hierarchy tree to solve similar degrees of mixed data. This paper utilizes artificial simulation materials and collects a piece of actual data about the family income to do experiments. The results show that the M-ART algorithm can process the mixed data and has a great effect on clustering.  相似文献   

6.
Biomedical time series clustering that automatically groups a collection of time series according to their internal similarity is of importance for medical record management and inspection such as bio-signals archiving and retrieval. In this paper, a novel framework that automatically groups a set of unlabelled multichannel biomedical time series according to their internal structural similarity is proposed. Specifically, we treat a multichannel biomedical time series as a document and extract local segments from the time series as words. We extend a topic model, i.e., the Hierarchical probabilistic Latent Semantic Analysis (H-pLSA), which was originally developed for visual motion analysis to cluster a set of unlabelled multichannel time series. The H-pLSA models each channel of the multichannel time series using a local pLSA in the first layer. The topics learned in the local pLSA are then fed to a global pLSA in the second layer to discover the categories of multichannel time series. Experiments on a dataset extracted from multichannel Electrocardiography (ECG) signals demonstrate that the proposed method performs better than previous state-of-the-art approaches and is relatively robust to the variations of parameters including length of local segments and dictionary size. Although the experimental evaluation used the multichannel ECG signals in a biometric scenario, the proposed algorithm is a universal framework for multichannel biomedical time series clustering according to their structural similarity, which has many applications in biomedical time series management.  相似文献   

7.
针对字符型数据和混合型数据的聚类方法进行了研究。首先在经典粗糙集理论的基础上,通过松弛对 象之间的不可分辨和相容性条件,得到了基于和谐关系的扩展粗糙集模型;然后定义了新的个体间不可区分度、 类间不可区分度、聚类结果的综合近似精度等概念,提出了新的混合数据类型层次聚类算法。该算法不仅能处 理数值型数据,而且能处理大多数聚类算法不能处理的字符型数据和混合型数据。实验验证了算法的可行性。  相似文献   

8.
9.
This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.  相似文献   

10.
基于划分和层次的混合动态聚类算法*   总被引:1,自引:0,他引:1  
针对划分聚类对初始值较为敏感以及层次聚类时间复杂度高等缺陷,提出了一种基于划分和层次的混合动态聚类算法HDC-PH。该算法首先使用划分聚类快速生成一定数量的子簇,然后以整体相似度的聚类质量评价标准来动态改变聚类数目,同时给出了聚类过程中孤立点的剔除方法。实验结果表明,HDC-PH算法的性能明显优于划分和层次算法,提高了聚类质量,并获得了更自然的聚类结果。  相似文献   

11.
Chameleon: hierarchical clustering using dynamic modeling   总被引:8,自引:0,他引:8  
Clustering is a discovery process in data mining. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model. By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters. Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters. Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon finds the clusters in the data set by using a two-phase algorithm. During the first phase, Chameleon uses a graph partitioning algorithm to cluster the data items into several relatively small subclusters. During the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these subclusters  相似文献   

12.
We propose a semantic clustering model based on a fuzzy inference system to find out the semantic neighborhood relationships in wireless sensor networks in order to both reduce energy consumption and improve the data accuracy. As a case study we describe a structural health monitoring application which was used to illustrate and assess the proposed model. We conduct experiments in order to evaluate the proposal in two different scenarios of damage with different data aggregation methods. We also compared our proposal, using the same data set, with a deterministic clustering method and with the LEACH algorithm. The results indicate that our approach is an energy-efficient clustering method for WSNs, outperforming both the deterministic clustering and LEACH algorithms in about 70% and 47% of energy savings respectively. The energy saving comes from the fact that we have a more efficient in-network data aggregation process since by exploiting the semantic relation between sensor nodes we can potentially aggregate more similar data and consequently, decrease the data redundancy (thus minimizing transmissions). Nodes that are semantically unrelated can operate in low-duty cycle, further reducing the energy consumption. Moreover, our proposal has the potential to improve the data accuracy provided for the application where accuracy is a QoS requirement in typical WSN applications.  相似文献   

13.
The linkage methods are mostly used in hierarchical clustering. In this paper, we integrate Ordered Weighted Averaging (OWA) operator with hierarchical clustering in order to find distances between clusters. In case of using OWA operator in order to find distance between clusters, OWA acts as a generalized case of single linkage, complete linkage, and average linkage methods. In order to illustrate the proposed method, we handle a phylogenetic tree constructed by hierarchical clustering of protein sequences. To illustrate the efficiency of the method, we use 2D-data set. We obtain graphs demonstrating the relationships of the clusters and we calculate the root-mean-square standard deviation (RMSSDT) and R-squared (RS) validity indices, respectively, which are frequently used to evaluate results of the hierarchical clustering algorithms.  相似文献   

14.
针对基于密度的传统算法不能处理混合属性数据,以及目前的混合属性聚类算法大多数聚类质量不高等问题,提出了基于密度和混合距离度量方法的混合属性聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数值占优、分类占优和均衡型混合属性数据3类,分析不同情况的特征选取相应的距离度量方式,通过预设参数能够发现数据密集区域,确定核心点,再利用核心点确定密度相连的对象实现聚类,获得最终的聚类结果.将算法应用于多种数据集上的实验结果表明,该算法具有较高的聚类质量,能够有效处理混合属性数据.  相似文献   

15.
基于层次聚类的弱小目标检测算法   总被引:3,自引:0,他引:3       下载免费PDF全文
空间图像具有恒星、目标和噪声特征相似,星点灰度范围大的特点,常见的小目标检测方法无法有效处理该类图像。提出了基于层次聚类的空间弱小目标检测算法,以星点到参考恒星之间的距离变化为依据,根据恒星和目标的运动特性构造相似性度量函数,通过寻找误差平方和曲线拐点的方法寻找最优分类曲面和分类个数,最后以两层复合分类将恒星、噪声和目标分离。实验结果表明,该方法兼容8位和16位灰度图像,可以有效检测出单点和多点小目标。  相似文献   

16.
基于层次聚类的孤立点检测方法   总被引:2,自引:1,他引:2       下载免费PDF全文
孤立点检测是数据挖掘过程的重要环节,提出了基于层次聚类的孤立点检测(ODHC)方法。ODHC方法基于层次聚类结果进行分析,对距离矩阵按簇间距离从大到小检测孤立点,可检测出指定离群程度的孤立点,直到达到用户对数据的集中性要求。该方法适用于多维数据集,且算法原理直观,用户友好,对孤立点的检测准确率较高。在iris、balloon等数据集上的仿真实验结果表明,ODHC方法能有效地识别孤立点,是一种简单实用的孤立点检测方法。  相似文献   

17.
为了使分簇后的网络更便于数据融合,对最小生成树(MST)的性质进行了研究,论证并实现了一种新的基于MST性质的分布式多层分簇算法.分簇过程中,节点各自独立运行该算法,利用生成的局部MST传递并融合连接信息以完成本层级的网络分簇.经过多次的连接信息间的融合,逐渐形成一个便于数据融合的多层分簇网络.实验分析表明,该算法具有收敛速度快、资源消耗低的优点.  相似文献   

18.
现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此提出一种快速有效的基于层次聚类的全局孤立点检测方法。该方法基于层次聚类的结果,根据聚类树和距离矩阵可视化判断数据孤立程度,并确定孤立点数目。从聚类树自顶向下,无监督地去除孤立点。仿真实验验证了方法能快速有效识别全局孤立点,具有用户友好性,适用于不同形状的数据集,可用于大型高维数据集的孤立点检测。  相似文献   

19.
贾俊芳 《计算机应用》2011,31(8):2134-2137
针对传统主动学习(AL)方法对大规模的无标记样本分类收敛速度过慢的问题,提出了基于层次聚类(HC)的主动学习训练算法--HC_AL方法。通过对大规模的未标记数据进行层次聚类,并对每个层次上的类中心打标记来代替该层次上的类标记,然后将该层次上具有错误标记的类中心加入训练集。在数据集上的实验取得了较好的泛化能力和较快的收敛速度。实验结果表明通过采用分层细化、逐步求精的方法,可使主动学习的收敛速度大大提高,同时获得较为满意的学习能力。  相似文献   

20.
The nearest neighbors relation (NNR) is defined in terms of a given asymmetric matrix of similarities of data items. This paper presents a new clustering algorithm, called CLASSIC, based on an iteratively defined nested sequence of NNRs. CLASSIC has been applied to various types of gestalt clustering problems. For CLASSIC applications in which asymmetric similarities are not available a priori, this paper also introduces a method for obtaining asymmetric similarities from Euclidean distances. This method has been used in the detection of gestalt clusters by CLASSIC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号