首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Semisupervised learning from different information sources   总被引:1,自引:1,他引:1  
This paper studies the use of a semisupervised learning algorithm from different information sources. We first offer a theoretical explanation as to why minimising the disagreement between individual models could lead to the performance improvement. Based on the observation, this paper proposes a semisupervised learning approach that attempts to minimise this disagreement by employing a co-updating method and making use of both labeled and unlabeled data. Three experiments to test the effectiveness of the approach are presented in this paper: (i) webpage classification from both content and hyperlinks; (ii) functional classification of gene using gene expression data and phylogenetic data and (iii) machine self-maintaining from both sensory and image data. The results show the effectiveness and efficiency of our approach and suggest its application potentials.  相似文献   

2.
Recently, we have developed the hierarchical generative topographic mapping (HGTM), an interactive method for visualization of large high-dimensional real-valued data sets. We propose a more general visualization system by extending HGTM in three ways, which allows the user to visualize a wider range of data sets and better support the model development process. 1) We integrate HGTM with noise models from the exponential family of distributions. The basic building block is the latent trait model (LTM). This enables us to visualize data of inherently discrete nature, e.g., collections of documents, in a hierarchical manner. 2) We give the user a choice of initializing the child plots of the current plot in either interactive, or automatic mode. In the interactive mode, the user selects "regions of interest", whereas in the automatic mode, an unsupervised minimum message length (MML)-inspired construction of a mixture of LTMs is employed. The unsupervised construction is particularly useful when high-level plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualizing large data sets. 3) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualization plots, since they can highlight the boundaries between data clusters. We illustrate our approach on a toy example and evaluate it on three more complex real data sets.  相似文献   

3.
High-dimensional data visualization is a more complex process than the ordinary dimensionality reduction to two or three dimensions. Therefore, we propose and evaluate a novel four-step visualization approach that is built upon the combination of three components: metric learning, intrinsic dimensionality estimation, and feature extraction. Although many successful applications of dimensionality reduction techniques for visualization are known, we believe that the sophisticated nature of high-dimensional data often needs a combination of several machine learning methods to solve the task. Here, this is provided by a novel framework and experiments with real-world data.  相似文献   

4.
Binary segmentation procedures (in particular, classification and regression trees) are extended to study the relation between dissimilarity data and a set of explanatory variables. The proposed split criterion is very flexible, and can be applied to a wide range of data (e.g., mixed types of multiple responses, longitudinal data, sequence data). Also, it can be shown to be an extension of well-established criteria introduced in the literature on binary trees.  相似文献   

5.
This paper presents a new technique for clustering either object or relational data. First, the data are represented as a matrix D of dissimilarity values. D is reordered to D * using a visual assessment of cluster tendency algorithm. If the data contain clusters, they are suggested by visually apparent dark squares arrayed along the main diagonal of an image I( D *) of D *. The suggested clusters in the object set underlying the reordered relational data are found by defining an objective function that recognizes this blocky structure in the reordered data. The objective function is optimized when the boundaries in I( D *) are matched by those in an aligned partition of the objects. The objective function combines measures of contrast and edginess and is optimized by particle swarm optimization. We prove that the set of aligned partitions is exponentially smaller than the set of partitions that needs to be searched if clusters are sought in D . Six numerical examples are given to illustrate various facets of the algorithm. © 2009 Wiley Periodicals, Inc.  相似文献   

6.
Multimedia Tools and Applications - One of the main challenges in hierarchical object classification is the derivation of the correct hierarchical structure. The classic way around the problem is...  相似文献   

7.
8.
The problem of clustering with side information has received much recent attention and metric learning has been considered as a powerful approach to this problem. Until now, various metric learning methods have been proposed for semi-supervised clustering. Although some of the existing methods can use both positive (must-link) and negative (cannot-link) constraints, they are usually limited to learning a linear transformation (i.e., finding a global Mahalanobis metric). In this paper, we propose a framework for learning linear and non-linear transformations efficiently. We use both positive and negative constraints and also the intrinsic topological structure of data. We formulate our metric learning method as an appropriate optimization problem and find the global optimum of this problem. The proposed non-linear method can be considered as an efficient kernel learning method that yields an explicit non-linear transformation and thus shows out-of-sample generalization ability. Experimental results on synthetic and real-world data sets show the effectiveness of our metric learning method for semi-supervised clustering tasks.  相似文献   

9.
In this paper, we propose a novel method to measure the dissimilarity of categorical data. The key idea is to consider the dissimilarity between two categorical values of an attribute as a combination of dissimilarities between the conditional probability distributions of other attributes given these two values. Experiments with real data show that our dissimilarity estimation method improves the accuracy of the popular nearest neighbor classifier.  相似文献   

10.
Pairwise dissimilarity representations are frequently used as an alternative to feature vectors in pattern recognition. One of the problems encountered in the analysis of such data is that the dissimilarities are rarely Euclidean, while statistical learning algorithms often rely on Euclidean dissimilarities. Such non-Euclidean dissimilarities are often corrected or a consistent Euclidean geometry is imposed on them via embedding. This paper commences by reviewing the available algorithms for analysing non-Euclidean dissimilarity data. The novel contribution is to show how the Ricci flow can be used to embed and rectify non-Euclidean dissimilarity data. According to our representation, the data is distributed over a manifold consisting of patches. Each patch has a locally uniform curvature, and this curvature is iteratively modified by the Ricci flow. The raw dissimilarities are the geodesic distances on the manifold. Rectified Euclidean dissimilarities are obtained using the Ricci flow to flatten the curved manifold by modifying the individual patch curvatures. We use two algorithmic components to implement this idea. Firstly, we apply the Ricci flow independently to a set of surface patches that cover the manifold. Second, we use curvature regularisation to impose consistency on the curvatures of the arrangement of different surface patches. We perform experiments on three real world datasets, and use these to determine the importance of the different algorithmic components, i.e. Ricci flow and curvature regularisation. We conclude that curvature regularisation is an essential step needed to control the stability of the piecewise arrangement of patches under the Ricci flow.  相似文献   

11.
王林  郭娜娜 《计算机应用》2017,37(4):1032-1037
针对传统分类技术对不均衡电信客户数据集中流失客户识别能力不足的问题,提出一种基于差异度的改进型不均衡数据分类(IDBC)算法。该算法在基于差异度分类(DBC)算法的基础上改进了原型选择策略。在原型选择阶段,利用改进型的样本子集优化方法从整体数据集中选择最具参考价值的原型集,从而避免了随机选择所带来的不确定性;在分类阶段,分别利用训练集和原型集、测试集和原型集样本之间的差异性构建相应的特征空间,进而采用传统的分类预测算法对映射到相应特征空间内的差异度数据集进行学习。最后选用了UCI数据库中的电信客户数据集和另外6个普通的不均衡数据集对该算法进行验证,相对于传统基于特征的不均衡数据分类算法,DBC算法对稀有类的识别率平均提高了8.3%,IDBC算法对稀有类的识别率平均提高了11.3%。实验结果表明,所提IDBC算法不受类别分布的影响,而且对不均衡数据集中稀有类的识别能力优于已有的先进分类技术。  相似文献   

12.
One of the critical aspects of clustering algorithms is the correct identification of the dissimilarity measure used to drive the partitioning of the data set. The dissimilarity measure induces the cluster shape and therefore determines the success of clustering algorithms. As cluster shapes change from a data set to another, dissimilarity measures should be extracted from data. To this aim, we exploit some pairs of points with known dissimilarity value to teach a dissimilarity relation to a feed-forward neural network. Then, we use the neural dissimilarity measure to guide an unsupervised relational clustering algorithm. Experiments on synthetic data sets and on the Iris data set show that the relational clustering algorithm based on the neural dissimilarity outperforms some popular clustering algorithms (with possible partial supervision) based on spatial dissimilarity.  相似文献   

13.
Context plays an important role when performing classification, and in this paper we examine context from two perspectives. First, the classification of items within a single task is placed within the context of distinct concurrent or previous classification tasks (multiple distinct data collections). This is referred to as multi-task learning (MTL), and is implemented here in a statistical manner, using a simplified form of the Dirichlet process. In addition, when performing many classification tasks one has simultaneous access to all unlabeled data that must be classified, and therefore there is an opportunity to place the classification of any one feature vector within the context of all unlabeled feature vectors; this is referred to as semi-supervised learning. In this paper we integrate MTL and semi-supervised learning into a single framework, thereby exploiting two forms of contextual information. Example results are presented on a "toy" example, to demonstrate the concept, and the algorithm is also applied to three real data sets.  相似文献   

14.
Gold?s original paper on inductive inference introduced a notion of an optimal learner. Intuitively, a learner identifies a class of objects optimally iff there is no other learner that: requires as little of each presentation of each object in the class in order to identify that object, and, for some presentation of some object in the class, requires less of that presentation in order to identify that object. Beick considered this notion in the context of function learning, and gave an intuitive characterization of an optimal function learner. Jantke and Beick subsequently characterized the classes of functions that are algorithmically, optimally identifiable.Herein, Gold?s notion is considered in the context of language learning. It is shown that a characterization of optimal language learners analogous to Beick?s does not hold. It is also shown that the classes of languages that are algorithmically, optimally identifiable cannot be characterized in a manner analogous to that of Jantke and Beick.Other interesting results concerning optimal language learning include the following. It is shown that strong non-U-shapedness, a property involved in Beick?s characterization of optimal function learners, does not restrict algorithmic language learning power. It is also shown that, for an arbitrary optimal learner F of a class of languages L, F optimally identifies a subclass K of L iff F is class-preserving with respect to K.  相似文献   

15.
Clustering is to group similar data and find out hidden information about the characteristics of dataset for the further analysis. The concept of dissimilarity of objects is a decisive factor for good quality of results in clustering. When attributes of data are not just numerical but categorical and high dimensional, it is not simple to discriminate the dissimilarity of objects which have synonymous values or unimportant attributes. We suggest a method to quantify the level of difference between categorical values and to weigh the implicit influence of each attribute on constructing a particular cluster. Our method exploits distributional information of data correlated with each categorical value so that intrinsic relationship of values can be discovered. In addition, it measures significance of each attribute in constructing respective cluster dynamically. Experiments on real datasets show the propriety and effectiveness of the method, which improves the results considerably even with simple clustering algorithms. Our approach does not couple with a clustering algorithm tightly and can also be applied to various algorithms flexibly.  相似文献   

16.
Statistical pattern recognition traditionally relies on feature-based representation. For many applications, such vector representation is not available and we only possess proximity data (distance, dissimilarity, similarity, ranks, etc.). In this paper, we consider a particular point of view on discriminant analysis from dissimilarity data. Our approach is inspired by the Gaussian classifier and we defined decision rules to mimic the behavior of a linear or a quadratic classifier. The number of parameters is limited (two per class). Numerical experiments on artificial and real data show interesting behavior compared to Support Vector Machines and to kNN classifier: (a) lower or equivalent error rate, (b) equivalent CPU time, (c) more robustness with sparse dissimilarity data.  相似文献   

17.
提出了相异度导引的有监督鉴别分析方法(D-SDA)。结合模式局部信息和全局信息,定义了类内散度权重矩阵[RW]和类间散度权重矩阵[RB],分别表示类内样本的相异度、类间样本的相异度。由[RW]、[RB]导出类内散度矩阵[SW]和类间散度矩阵[SB],根据Fisher鉴别准则函数确定最优变换矩阵。在YALE和AR人脸图像库上的实验验证了这一算法的有效性。  相似文献   

18.
提出度量多个集合之间总体差异程度的拓展集合差异度及相关定理,并给出一种新的解决分类属性高维数据聚类问题的CAESD算法。基于拓展集合差异度及拓展集合特征向量,在CABOSFV_C聚类的基础上通过两阶段聚类完成全部聚类过程。采用UCI数据集与K-modes及其改进算法、CABOSFV_C算法进行比较实验,结果表明CAESD算法具有较高的聚类正确率。  相似文献   

19.
This paper presents a method for designing semi-supervised classifiers trained on labeled and unlabeled samples. We focus on probabilistic semi-supervised classifier design for multi-class and single-labeled classification problems, and propose a hybrid approach that takes advantage of generative and discriminative approaches. In our approach, we first consider a generative model trained by using labeled samples and introduce a bias correction model, where these models belong to the same model family, but have different parameters. Then, we construct a hybrid classifier by combining these models based on the maximum entropy principle. To enable us to apply our hybrid approach to text classification problems, we employed naive Bayes models as the generative and bias correction models. Our experimental results for four text data sets confirmed that the generalization ability of our hybrid classifier was much improved by using a large number of unlabeled samples for training when there were too few labeled samples to obtain good performance. We also confirmed that our hybrid approach significantly outperformed generative and discriminative approaches when the performance of the generative and discriminative approaches was comparable. Moreover, we examined the performance of our hybrid classifier when the labeled and unlabeled data distributions were different.  相似文献   

20.
Automatic classification is one of the basic tasks required in any pattern recognition and human computer interaction application. In this paper, we discuss training probabilistic classifiers with labeled and unlabeled data. We provide a new analysis that shows under what conditions unlabeled data can be used in learning to improve classification performance. We also show that, if the conditions are violated, using unlabeled data can be detrimental to classification performance. We discuss the implications of this analysis to a specific type of probabilistic classifiers, Bayesian networks, and propose a new structure learning algorithm that can utilize unlabeled data to improve classification. Finally, we show how the resulting algorithms are successfully employed in two applications related to human-computer interaction and pattern recognition: facial expression recognition and face detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号