首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Traditional semi‐supervised clustering uses only limited user supervision in the form of instance seeds for clusters and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by indicating whether it discriminates among clusters. This article thus fills this void by enhancing traditional semi‐supervised clustering with feature supervision, which asks the user to label discriminating features during defining (labeling) the instance seeds or pairwise instance constraints. Various types of semi‐supervised clustering algorithms were explored with feature supervision. Our experimental results on several real‐world data sets demonstrate that augmenting the instance‐level supervision with feature‐level supervision can significantly improve document clustering performance.  相似文献   

2.
3.
We study the problem of label ranking, a machine learning task that consists of inducing a mapping from instances to rankings over a finite number of labels. Our learning method, referred to as ranking by pairwise comparison (RPC), first induces pairwise order relations (preferences) from suitable training data, using a natural extension of so-called pairwise classification. A ranking is then derived from a set of such relations by means of a ranking procedure. In this paper, we first elaborate on a key advantage of such a decomposition, namely the fact that it allows the learner to adapt to different loss functions without re-training, by using different ranking procedures on the same predicted order relations. In this regard, we distinguish between two types of errors, called, respectively, ranking error and position error. Focusing on the position error, which has received less attention so far, we then propose a ranking procedure called ranking through iterated choice as well as an efficient pairwise implementation thereof. Apart from a theoretical justification of this procedure, we offer empirical evidence in favor of its superior performance as a risk minimizer for the position error.  相似文献   

4.
Major problems exist in both crisp and fuzzy clustering algorithms. The fuzzy c-means type of algorithms use weights determined by a power m of inverse distances that remains fixed over all iterations and over all clusters, even though smaller clusters should have a larger m. Our method uses a different “distance” for each cluster that changes over the early iterations to fit the clusters. Comparisons show improved results. We also address other perplexing problems in clustering: (i) find the optimal number K of clusters; (ii) assess the validity of a given clustering; (iii) prevent the selection of seed vectors as initial prototypes from affecting the clustering; (iv) prevent the order of merging from affecting the clustering; and (v) permit the clusters to form more natural shapes rather than forcing them into normed balls of the distance function. We employ a relatively large number K of uniformly randomly distributed seeds and then thin them to leave fewer uniformly distributed seeds. Next, the main loop iterates by assigning the feature vectors and computing new fuzzy prototypes. Our fuzzy merging then merges any clusters that are too close to each other. We use a modified Xie-Bene validity measure as the goodness of clustering measure for multiple values of K in a user-interaction approach where the user selects two parameters (for eliminating clusters and merging clusters after viewing the results thus far). The algorithm is compared with the fuzzy c-means on the iris data and on the Wisconsin breast cancer data.  相似文献   

5.
标记分布学习(label distribution learning,LDL)是一种用于解决标记多义性的新颖学习范式。现有的LDL方法大多基于完整数据信息进行设计,然而由于高昂的标注成本以及标注人员水平的局限性,很难获取到完整标注数据信息,且会导致传统LDL算法性能的下降。为此,本文提出了一种新型的结合局部序标记关系的弱监督标记分布学习算法,通过维持尚未缺失标记之间的相对关系,并利用标记相关性来恢复缺失的标记,在数据标注不完整的情况下提升算法性能。在14个数据集上进行了大量的实验来验证算法的有效性。  相似文献   

6.
胡峰  刘鑫  邓维斌  代劲  刘群 《控制与决策》2023,38(6):1753-1760
偏标记学习是一种弱监督学习框架,它试图从样本的多个候选标签中选择唯一正确的标签.消歧是偏标记学习中的一种重要手段,主要通过算法判别潜在的真实标签.目前,人们普遍采用单一的特征空间或者标签空间进行消歧,容易导致算法受到不准确先验知识的引导而陷入鞍点.针对消歧过程中特征相似样本易受到异类样本影响从而影响消歧效果这一问题,定义了样本离异点和离异图;在此基础上,提出一种离异图引导消歧的偏标记学习方法.该方法利用标签空间的差异构建离异图,可以有效结合特征空间的相似性和标签空间的差异性,降低离异点为消歧过程带来的潜在风险.实验结果表明,与PLKNN、IPAL、SURE、PL-AGGD、SDIM、PL-BLC、PRODEN等方法相比较,所提出的算法在偏标签学习方法中表现更好,能够取得良好的消歧效果.  相似文献   

7.
Provision of training data sets is one of the core requirements for event-based supervised NILM (Non-Intrusive Load Monitoring) algorithms. Due to diversity in appliances’ technologies, in-situ training by users is often required. This process might require continuous user-interaction to ensure that a high quality training data set is provided. Pre-populating a training data set could potentially reduce the need for user-system interaction. In this study, a heuristic unsupervised clustering algorithm is presented and evaluated to enable autonomous partitioning of appliances signature space (i.e. feature space) for applications in electricity consumption disaggregation. The algorithm is based on hierarchical clustering and uses the characteristics of a cluster binary tree to determine the distance threshold for pruning the tree without a priori information. The algorithm determines the partition of a feature space recursively to account for multi-scale nature of the binary cluster tree. Evaluation of the algorithm was carried out using metrics for accuracy and cluster quality (proposed in this study) on a fully labeled data set that was collected and processed in a real residential setting. The algorithm performance in accurate partitioning of the feature space and the effect of different feature extraction techniques were presented and discussed.  相似文献   

8.
This paper details a modular, self-contained web search results clustering system that enhances search results by (i) performing clustering on lists of web documents returned by queries to search engines, and (ii) ranking the results and labeling the resulting clusters, by using a calculated relevance value as a degree of membership to clusters. In addition, we demonstrate an external evaluation method based on precision for comparing fuzzy clustering techniques, as well as internal measures suitable for working on non-training data. The built-in label generator uses the membership degrees and relevance values to weight the most relevant results more heavily. The membership degrees of documents to fuzzy clusters also facilitate effective detection and removal of overly similar clusters. To achieve this, our transduction-based clustering algorithm (TCA) and its fuzzy counterpart (FTCA) employ a transduction-based relevance model (TRM) to consider local relationships between each web document. Results from testing on five different real-world and synthetic datasets results show favorable results compared to established label-based clustering algorithms Suffix Tree Clustering (STC) and Lingo.  相似文献   

9.
Supervised clustering is a new research area that aims to improve unsupervised clustering algorithms exploiting supervised information. Today, there are several clustering algorithms, but the effective supervised cluster adjustment method which is able to adjust the resulting clusters, regardless of applied clustering algorithm has not been presented yet. In this paper, we propose a new supervised cluster adjustment method which can be applied to any clustering algorithm. Since the adjustment method is based on finding the nearest neighbors, a novel exact nearest neighbor search algorithm is also introduced which is significantly faster than the classic one. Several datasets and clustering evaluation metrics are employed to examine the effectiveness of the proposed cluster adjustment method and the proposed fast exact nearest neighbor algorithm comprehensively. The experimental results show that the proposed algorithms are significantly effective in improving clusters and accelerating nearest neighbor searches.  相似文献   

10.
11.
Automatic content-based image categorization is a challenging research topic and has many practical applications. Images are usually represented as bags of feature vectors, and the categorization problem is studied in the Multiple-Instance Learning (MIL) framework. In this paper, we propose a novel learning technique which transforms the MIL problem into a standard supervised learning problem by defining a feature vector for each image bag. Specifically, the feature vectors of the image bags are grouped into clusters and each cluster is given a label. Using these labels, each instance of an image bag can be replaced by a corresponding label to obtain a bag of cluster labels. Data mining can then be employed to uncover common label patterns for each image category. These label patterns are converted into bags of feature vectors; and they are used to transform each image bag in the data set into a feature vector such that each vector element is the distance of the image bag to a distinct pattern bag. With this new image representation, standard supervised learning algorithms can be applied to classify the images into the pre-defined categories. Our experimental results demonstrate the superiority of the proposed technique in categorization accuracy as compared to state-of-the-art methods.  相似文献   

12.
13.
针对标签均值半监督支持向量机在图像分类中随机选取无标记样本会导致分类正确率不高,以及算法的稳定性较低的问题,提出了基于聚类标签均值的半监督支持向量机算法。该算法修改了原算法对于无标记样本的惩罚项,对选取的无标记样本聚类,使用聚类标签均值替换标签均值。实验结果表明,使用聚类标签均值训练的分类器大大减少了背景与目标的错分情况,提高了分类的正确率以及算法的稳定性,适合用于图像分类。  相似文献   

14.
15.
半监督的改进K-均值聚类算法   总被引:4,自引:1,他引:3       下载免费PDF全文
K-均值聚类算法必须事先获取聚类数目,并且随机地选取聚类初始中心会造成聚类结果不稳定,容易在获得一个局部最优值时终止。提出了一种基于半监督学习理论的改进K-均值聚类算法,利用少量标签数据建立图的最小生成树并迭代分裂获取K-均值聚类算法所需要的聚类数和初始聚类中心。在IRIS数据集上的实验表明,尽管随机样本构造的生成树不同,聚类中心也不同,但聚类是一致且稳定的,迭代的次数较少,验证了该文算法的有效性。  相似文献   

16.
针对现有多标签特征选择方法存在的两个问题:第一,忽略了学习标签相关性过程中噪声信息的影响;第二,忽略探索每个簇的综合标签信息,提出一种增强学习标签相关性的多标签特征选择方法。首先,对样本进行聚类,并将每个簇中心视为一个综合样本语义信息的代表性实例,同时计算其对应的标签向量,而这些标签向量体现了每个簇包含不同标签的重要程度;其次,通过原始样本和每个簇中心的标签级自表示,既捕获了原始标签空间中的标签相关性,又探索了每一个簇内的标签相关性;最后,对自表示系数矩阵进行稀疏处理,以减少噪声的影响,并将原始样本和每个簇代表性实例分别从特征空间映射到重构标签空间进行特征选择。在9个多标签数据集上的实验结果表明,所提的算法与其他方法相比具有更好的性能。  相似文献   

17.
Distance to second cluster as a measure of classification confidence   总被引:1,自引:0,他引:1  
Most image classification algorithms rely on computing the distance between the unique spectral signature of a given pixel and a set of possible clusters within an n-dimensional feature space that represents discrete land cover categories. Each scrutinized pixel will ultimately be closest to one of the predefined clusters; different classification algorithms differ in the details of which cluster is considered as closest or most likely, but in general the selected algorithm will label each pixel with the label of the closest cluster. However, pixels expressing virtually identical distances to two or more clusters identify a limitation of this typical classification approach. Conditions for limitations to distance based classification algorithms include when distances are long and the pixel may not clearly belong to any single category, may represent mixed land cover, or can be easily confused spectrally between two or more categories. We propose that retention of the distance to the second closest cluster can shed light on the confidence with which label assignment proceeds and present several examples of how such additional information might enhance accuracy assessments and improve classification confidence. The method was developed with simplicity as a goal, assuming the classification has already been performed, and standard clustering reports are available. Over a test site in central British Columbia, Canada, we illustrate the described technique using classified image data from a nation-wide land cover mapping project. Calculation of multi-spectral Euclidean distances to cluster centroids, standardized by cluster variance, allows comparison of all potential class assignments within a unified framework. The variable distances provide a measure of relative confidence in the actual classification at the level of individual pixels.  相似文献   

18.
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.  相似文献   

19.
Modern day computers cannot provide optimal solution to the clustering problem. There are many clustering algorithms that attempt to provide an approximation of the optimal solution. These clustering techniques can be broadly classified into two categories. The techniques from first category directly assign objects to clusters and then analyze the resulting clusters. The methods from second category adjust representations of clusters and then determine the object assignments. In terms of disciplines, these techniques can be classified as statistical, genetic algorithms based, and neural network based. This paper reports the results of experiments comparing five different approaches: hierarchical grouping, object-based genetic algorithms, cluster-based genetic algorithms, Kohonen neural networks, and K-means method. The comparisons consist of the time requirements and within-group errors. The theoretical analyses were tested for clustering of highway sections and supermarket customers. All the techniques were applied to clustering of highway sections. The hierarchical grouping and genetic algorithms approaches were computationally infeasible for clustering a larger set of supermarket customers. Hence only Kohonen neural networks and K-means techniques were applied to the second set to confirm some of the results from previous experiments.  相似文献   

20.
《Knowledge》2006,19(1):77-83
Ensemble methods that train multiple learners and then combine their predictions have been shown to be very effective in supervised learning. This paper explores ensemble methods for unsupervised learning. Here, an ensemble comprises multiple clusterers, each of which is trained by k-means algorithm with different initial points. The clusters discovered by different clusterers are aligned, i.e. similar clusters are assigned with the same label, by counting their overlapped data items. Then, four methods are developed to combine the aligned clusterers. Experiments show that clustering performance could be significantly improved by ensemble methods, where utilizing mutual information to select a subset of clusterers for weighted voting is a nice choice. Since the proposed methods work by analyzing the clustering results instead of the internal mechanisms of the component clusterers, they are applicable to diverse kinds of clustering algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号