首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
In this paper, an efficient K-medians clustering (unsupervised) algorithm for prototype selection and Supervised K-medians (SKM) classification technique for protein sequences are presented. For sequence data sets, a median string/sequence can be used as the cluster/group representative. In K-medians clustering technique, a desired number of clusters, K, each represented by a median string/sequence, is generated and these median sequences are used as prototypes for classifying the new/test sequence whereas in SKM classification technique, median sequence in each group/class of labelled protein sequences is determined and the set of median sequences is used as prototypes for classification purpose. It is found that the K-medians clustering technique outperforms the leader based technique and also SKM classification technique performs better than that of motifs based approach for the data sets used. We further use a simple technique to reduce time and space requirements during protein sequence clustering and classification. During training and testing phase, the similarity score value between a pair of sequences is determined by selecting a portion of the sequence instead of the entire sequence. It is like selecting a subset of features for sequence data sets. The experimental results of the proposed method on K-medians, SKM and Nearest Neighbour Classifier (NNC) techniques show that the Classification Accuracy (CA) using the prototypes generated/used does not degrade much but the training and testing time are reduced significantly. Thus the experimental results indicate that the similarity score does not need to be calculated by considering the entire length of the sequence for achieving a good CA. Even space requirement is reduced during both training and classification.  相似文献   

2.
张豪  陈黎飞  郭躬德 《计算机科学》2015,42(5):114-118, 141
符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等.作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题.首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性.在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类.在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度.  相似文献   

3.
唐东明  朱清新  杨凡  陈科 《软件学报》2011,22(8):1827-1837
提出了一种有效的基于仿射传播聚类算法和后处理方法的蛋白质序列聚类方法.在聚类分析蛋白质序列时,为了优化仿射传播聚类算法的聚类结果,采用后处理的方式来提高聚类结果的质量.为了度量蛋白质序列之间的相似度,给出了一种改进的无比对计算方法.在6个蛋白质序列数据集上进行对比实验,实验结果表明,所给出的方法能够有效地分析蛋白质序列.  相似文献   

4.
Gene clustering is one of the most important problems in bioinformatics. In the sequential data clustering, hidden Markov models (HMMs) have been widely used to find similarity between sequences, due to their capability of handling sequence patterns with various lengths. In this paper, a novel gene clustering scheme based on HMMs optimized by particle swarm optimization algorithm is introduced. In this approach, each gene sequence is described by a specific HMM, and then for each model, its probability to generate individual sequence is evaluated. A hierarchical clustering algorithm based on a new definition of a distance measure has been applied to find the best clusters. Experiments carried out on lung cancer-related genes dataset show that the proposed approach can be successfully utilized for gene clustering.  相似文献   

5.
针对符号序列聚类中表示模型及序列间距离度量定义的困难问题,提出一种基于概率向量的表示模型及基于该模型的符号序列聚类算法。该模型引入符号序列的概率分布表示法,定义了一种基于概率分布差异的符号序列距离度量及该模型的目标函数,最后给出了一种符号序列K-均值型聚类算法,并在来自不同领域的实际应用序列集上进行了实验验证。实验结果表明,与基于子序列表示模型的符号序列聚类算法相比,所提方法在DNA序列和语音序列等具有较多符号的实际数据上,有效提高聚类精度的同时降低聚类时间50%以上。  相似文献   

6.
Similarity measure of contents plays an important role in TV personalization, e.g., TV content group recommendation and similar TV content retrieval, which essentially are content clustering and example-based retrieval. We define similar TV contents to be those with similar semantic information, e.g., plot, background, genre, etc. Several similarity measure methods, notably vector space model based and category hierarchy model based similarity measure schemes, have been proposed for the purpose of data clustering and example-based retrieval. Each method has advantages and shortcomings of its own in TV content similarity measure. In this paper, we propose a hybrid approach for TV content similarity measure, which combines both vector space model and category hierarchy model. The hybrid measure proposed here makes the most of TV metadata information and takes advantage of the two similarity measurements. It measures TV content similarity from the semantic level other than the physical level. Furthermore, we propose an adaptive strategy for setting the combination parameters. The experimental results showed that using the hybrid similarity measure proposed here is superior to using either alone for TV content clustering and example-based retrieval.  相似文献   

7.
Efficient Phrase-Based Document Similarity for Clustering   总被引:1,自引:0,他引:1  
In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.  相似文献   

8.
Text document clustering using global term context vectors   总被引:2,自引:2,他引:0  
Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.  相似文献   

9.
李海林    梁叶 《智能系统学报》2019,14(2):288-295
利用时间序列聚类方法进行股指期货的套期保值,关键要选择合适的聚类方法。本文从新的视角来研究并提高时间序列聚类方法在金融数据分析领域的应用性能,提出一种基于标签传播时间序列聚类的股指期货套期保值模型。该模型以动态时间弯曲为相似性度量方法来构建现货股票网络空间结构,将每只股票看作一个节点,利用标签传播方法将节点划分到不同的簇中,最终实现股票数据聚类。另外,构建最小追踪误差优化模型来确定每支股票在现货组合中的最优权重,从而得到最优组合。实验分别比较新方法和传统聚类方法确定现货组合的追踪误差,结果表明新方法能够提高现货组合的追踪精度,为丰富金融市场投资和管理方式提供新的研究思路。  相似文献   

10.
We introduced a spectral clustering algorithm based on the bipartite graph model for the Manufacturing Cell Formation problem in [Oliveira S, Ribeiro JFF, Seok SC. A spectral clustering algorithm for manufacturing cell formation. Computers and Industrial Engineering. 2007 [submitted for publication]]. It constructs two similarity matrices; one for parts and one for machines. The algorithm executes a spectral clustering algorithm on each separately to find families of parts and cells of machines. The similarity measure in the approach utilized limited information between parts and between machines. This paper reviews several well-known similarity measures which have been used for Group Technology. Computational clustering results are compared by various performance measures.  相似文献   

11.
A novel ant-based clustering algorithm using the kernel method   总被引:1,自引:0,他引:1  
A novel ant-based clustering algorithm integrated with the kernel (ACK) method is proposed. There are two aspects to the integration. First, kernel principal component analysis (KPCA) is applied to modify the random projection of objects when the algorithm is run initially. This projection can create rough clusters and improve the algorithm’s efficiency. Second, ant-based clustering is performed in the feature space rather than in the input space. The distance between the objects in the feature space, which is calculated by the kernel function of the object vectors in the input space, is applied as a similarity measure. The algorithm uses an ant movement model in which each object is viewed as an ant. The ant determines its movement according to the fitness of its local neighbourhood. The proposed algorithm incorporates the merits of kernel-based clustering into ant-based clustering. Comparisons with other classic algorithms using several synthetic and real datasets demonstrate that ACK method exhibits high performance in terms of efficiency and clustering quality.  相似文献   

12.
Document Similarity Using a Phrase Indexing Graph Model   总被引:3,自引:1,他引:2  
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.  相似文献   

13.
徐鲲鹏  陈黎飞  孙浩军  王备战 《软件学报》2020,31(11):3492-3505
现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量.  相似文献   

14.
Many activities in business process management, such as process retrieval, process mining, and process integration, need to determine the similarity or the distance between two processes. Although several approaches have recently been proposed to measure the similarity between business processes, neither the definitions of the similarity notion between processes nor the measure methods have gained wide recognition. In this paper, we define the similarity and the distance based on firing sequences in the context of workflow nets (WF-nets) as the unified reference concepts. However, to many WF-nets, either the number of full firing sequences or the length of a single firing sequence is infinite. Since transition adjacency relations (TARs) can be seen as the genes of the firing sequences which describe transition orders appearing in all possible firing sequences, we propose a practical similarity definition based on the TAR sets of two processes. It is formally shown that the corresponding distance measure between processes is a metric. An algorithm using model reduction techniques for the efficient computation of the measure is also presented. Experimental results involving comparison of different measures on artificial processes and evaluations on clustering real-life processes validate our approach.  相似文献   

15.
Segmentation of a document image plays an important role in automatic document processing. In this paper, we propose a consensus-based clustering approach for document image segmentation. In this method, the foreground regions of a document image are grouped into a set of primitive blocks, and a set of features is extracted from them. Similarities among the blocks are computed on each feature using a hypothesis test-based similarity measure. Based on the consensus of these similarities, clustering is performed on the primitive blocks. This clustering approach is used iteratively with a classifier to label each primitive block. Experimental results show the effectiveness of the proposed method. It is further shown in the experimental results that the dependency of classification performance on the training data is significantly reduced.  相似文献   

16.
There are many parameters that may affect the navigation behaviour of web users. Prediction of the potential next page that may be visited by the web user is important, since this information can be used for prefetching or personalization of the page for that user. One of the successful methods for the determination of the next web page is to construct behaviour models of the users by clustering. The success of clustering is highly correlated with the similarity measure that is used for calculating the similarity among navigation sequences. This work proposes a new approach for determining the next web page by extending the standard clustering with the content-based semantic similarity method. Semantics of web-pages are represented as sets of concepts, and thus, user session are modelled as sequence of sets. As a result, session similarity is defined as an alignment of two sequences of sets. The success of the proposed method has been shown through applying it on real life web log data.  相似文献   

17.
相似性度量是聚类分析的重要基础,如何有效衡量类属型符号间的相似性是相似性度量的一个难点.文中根据离散符号的核概率密度衡量符号间的相似性,与传统的简单符号匹配及符号频度估计方法不同,该相似性度量在核函数带宽的作用下,不再依赖同一属性上符号间独立性假设.随后建立类属型数据的贝叶斯聚类模型,定义基于似然的类属型对象-簇间相似性度量,给出基于模型的聚类算法.采用留一估计和最大似然估计,提出3种求解方法在聚类过程中动态确定最优的核带宽.实验表明,相比使用特征加权或简单匹配距离的聚类算法,文中算法可以获得更高的聚类精度,估计的核函数带宽在重要特征识别等应用中具有实际意义.  相似文献   

18.
Knowledge-based vector space model for text clustering   总被引:5,自引:4,他引:1  
This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.  相似文献   

19.
This paper investigates facial image clustering, primarily for movie video content analysis with respect to actor appearance. Our aim is to use novel formulation of the mutual information as a facial image similarity criterion and, by using spectral graph analysis, to cluster a similarity matrix containing the mutual information of facial images. To this end, we use the HSV color space of a facial image (more precisely, only the hue and saturation channels) in order to calculate the mutual information similarity matrix of a set of facial images. We make full use of the similarity matrix symmetries, so as to lower the computational complexity of the new mutual information calculation. We assign each row of this matrix as feature vector describing a facial image for producing a global similarity criterion for face clustering. In order to test our proposed method, we conducted two sets of experiments that have produced clustering accuracy of more than 80%. We also compared our algorithm with other clustering approaches, such as the k-means and fuzzy c-means (FCM) algorithms. Finally, in order to provide a baseline comparison for our approach, we compared the proposed global similarity measure with another one recently reported in the literature.  相似文献   

20.
Efficient phrase-based document indexing for Web document clustering   总被引:4,自引:0,他引:4  
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号