首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Photo clustering is an effective way to organize albums and it is useful in many applications, such as photo browsing and tagging. But automatic photo clustering is not an easy task due to the large variation of photo content. In this paper, we propose an interactive photo clustering paradigm that jointly explores human and computer. In this paradigm, the photo clustering task is semi-automatically accomplished: users are allowed to manually adjust clustering results with different operations, such as splitting clusters, merging clusters and moving photos from one cluster to another. Behind users’ operations, we have a learning engine that keeps updating the distance measurements between photos in an online way, such that better clustering can be performed based on the distance measure. Experimental results on multiple photo albums demonstrated that our approach is able to improve automatic photo clustering results, and by exploring distance metric learning, our method is much more effective than pure manual adjustments of photo clustering.  相似文献   

2.
以往研究者都从公式的合理性出发研究迁移学习和传统机器学习,但他们忽视了对问题的整体性考虑,致使在具体应用到文本分类问题时,无法实现彻底的分类。通过研究文本分类的整个过程,在k-均值算法中使用余弦距离,显著提高了实验结果;提出保护型迭代思想,同时弃用传统的词特征空间,采用隐空间作为特征向量空间,实施归一化约束。以CCI算法为例,结合提出的改进思想,产生改进算法PCCI,在降低计算复杂度的同时显著提高迁移学习的分类正确率。通过在数据集20-NewsGroups和Reuters-21578上测试并与现有其他迁移学习算法进行比较,证明了该改进算法的优越性。  相似文献   

3.
Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms.  相似文献   

4.
5.
Large graphs are scale free and ubiquitous having irregular relationships. Clustering is used to find existent similar patterns in graphs and thus help in getting useful insights. In real-world, nodes may belong to more than one cluster thus, it is essential to analyze fuzzy cluster membership of nodes. Traditional centralized fuzzy clustering algorithms incur high communication cost and produce poor quality of clusters when used for large graphs. Thus, scalable solutions are obligatory to handle huge amount of data in less computational time with minimum disk access. In this paper, we proposed a parallel fuzzy clustering algorithm named ‘PGFC’ for handling scalable graph data. It will be advantageous from the viewpoint of expert systems to develop a clustering algorithm that can assure scalability along with better quality of clusters for handling large graphs.The algorithm is parallelized using bulk synchronous parallel (BSP) based Pregel model. The cluster centers are initialized using degree centrality measure, resulting in lesser number of iterations. The performance of PGFC is compared with other state of art clustering algorithms using synthetic graphs and real world networks. The experimental results reveal that the proposed PGFC scales up linearly to handle large graphs and produces better quality of clusters when compared to other graph clustering counterparts.  相似文献   

6.
Finding reliable subgraphs from large probabilistic graphs   总被引:3,自引:0,他引:3  
Reliable subgraphs can be used, for example, to find and rank nontrivial links between given vertices, to concisely visualize large graphs, or to reduce the size of input for computationally demanding graph algorithms. We propose two new heuristics for solving the most reliable subgraph extraction problem on large, undirected probabilistic graphs. Such a problem is specified by a probabilistic graph G subject to random edge failures, a set of terminal vertices, and an integer K. The objective is to remove K edges from G such that the probability of connecting the terminals in the remaining subgraph is maximized. We provide some technical details and a rough analysis of the proposed algorithms. The practical performance of the methods is evaluated on real probabilistic graphs from the biological domain. The results indicate that the methods scale much better to large input graphs, both computationally and in terms of the quality of the result.  相似文献   

7.
Classic linear dimensionality reduction (LDR) methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA), are known not to be robust against outliers. Following a systematic analysis of the multi-class LDR problem in a unified framework, we propose a new algorithm, called minimal distance maximization (MDM), to address the non-robustness issue. The principle behind MDM is to maximize the minimal between-class distance in the output space. MDM is formulated as a semi-definite program (SDP), and its dual problem reveals a close connection to “weighted” LDR methods. A soft version of MDM, in which LDA is subsumed as a special case, is also developed to deal with overlapping centroids. Finally, we drop the homoscedastic Gaussian assumption made in MDM by extending it in a non-parametric way, along with a gradient-based convex approximation algorithm to significantly reduce the complexity of the original SDP. The effectiveness of our proposed methods are validated on two UCI datasets and two face datasets.  相似文献   

8.
Distance metric is a key issue in many machine learning algorithms. This paper considers a general problem of learning from pairwise constraints in the form of must-links and cannot-links. As one kind of side information, a must-link indicates the pair of the two data points must be in a same class, while a cannot-link indicates that the two data points must be in two different classes. Given must-link and cannot-link information, our goal is to learn a Mahalanobis distance metric. Under this metric, we hope the distances of point pairs in must-links are as small as possible and those of point pairs in cannot-links are as large as possible. This task is formulated as a constrained optimization problem, in which the global optimum can be obtained effectively and efficiently. Finally, some applications in data clustering, interactive natural image segmentation and face pose estimation are given in this paper. Experimental results illustrate the effectiveness of our algorithm.  相似文献   

9.
Li  Xiaocui  Yin  Hongzhi  Zhou  Ke  Zhou  Xiaofang 《World Wide Web》2020,23(2):781-798
World Wide Web - As a common technology in social network, clustering has attracted lots of research interest due to its high performance, and many clustering methods have been presented. The most...  相似文献   

10.
Most existing semi-supervised clustering algorithms are not designed for handling high-dimensional data. On the other hand, semi-supervised dimensionality reduction methods may not necessarily improve the clustering performance, due to the fact that the inherent relationship between subspace selection and clustering is ignored. In order to mitigate the above problems, we present a semi-supervised clustering algorithm using adaptive distance metric learning (SCADM) which performs semi-supervised clustering and distance metric learning simultaneously. SCADM applies the clustering results to learn a distance metric and then projects the data onto a low-dimensional space where the separability of the data is maximized. Experimental results on real-world data sets show that the proposed method can effectively deal with high-dimensional data and provides an appealing clustering performance.  相似文献   

11.
The RELIEF algorithm is a popular approach for feature weighting. Many extensions of the RELIEF algorithm are developed, and I-RELIEF is one of the famous extensions. In this paper, I-RELIEF is generalized for supervised distance metric learning to yield a Mahananobis distance function. The proposed approach is justified by showing that the objective function of the generalized I-RELIEF is closely related to the expected leave-one-out nearest-neighbor classification rate. In addition, the relationships among the generalized I-RELIEF, the neighbourhood components analysis, and graph embedding are also pointed out. Experimental results on various data sets all demonstrate the superiority of the proposed approach.  相似文献   

12.
《工矿自动化》2017,(5):22-26
提出了一种基于距离度量学习的煤岩识别方法。该方法首先从煤岩图像训练集中提取煤岩图像特征;然后学习到特定的距离度量,使得煤样本特征间、岩石样本特征间距离变小,煤样本特征与岩石样本特征间距离变大,以提高分类识别效果;最后采用分类器进行煤岩识别。实验结果表明,对于煤岩样本图像的LBP特征、HOG特征、GLCM特征,与基于欧式距离、LDA、ITML的煤岩识别方法相比,该方法具有更高的煤岩识别率。  相似文献   

13.
A new model for supervised classification based on probabilistic decision graphs is introduced. A probabilistic decision graph (PDG) is a graphical model that efficiently captures certain context specific independencies that are not easily represented by other graphical models traditionally used for classification, such as the Naïve Bayes (NB) or Classification Trees (CT). This means that the PDG model can capture some distributions using fewer parameters than classical models. Two approaches for constructing a PDG for classification are proposed. The first is to directly construct the model from a dataset of labelled data, while the second is to transform a previously obtained Bayesian classifier into a PDG model that can then be refined. These two approaches are compared with a wide range of classical approaches to the supervised classification problem on a number of both real world databases and artificially generated data.  相似文献   

14.
Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in open source software contexts, but no study exists for large-scale proprietary projects in industry. The goal of this study is to evaluate automated bug assignment techniques that are based on machine learning classification. In particular, we study the state-of-the-art ensemble learner Stacked Generalization (SG) that combines several classifiers. We collect more than 50,000 bug reports from five development projects from two companies in different domains. We implement automated bug assignment and evaluate the performance in a set of controlled experiments. We show that SG scales to large scale industrial application and that it outperforms the use of individual classifiers for bug assignment, reaching prediction accuracies from 50 % to 89 % when large training sets are used. In addition, we show how old training data can decrease the prediction accuracy of bug assignment. We advice industry to use SG for bug assignment in proprietary contexts, using at least 2,000 bug reports for training. Finally, we highlight the importance of not solely relying on results from cross-validation when evaluating automated bug assignment.  相似文献   

15.
Graphs are universal modeling tools. They are used to represent objects and their relationships in almost all domains: they are used to represent DNA, images, videos, social networks, XML documents, etc. When objects are represented by graphs, the problem of their comparison is a problem of comparing graphs. Comparing objects is a key task in our daily life. It is the core of a search engine, the backbone of a mining tool, etc. Nowadays, comparing objects faces the challenge of the large amount of data that this task must deal with. Moreover, when graphs are used to model these objects, it is known that graph comparison is very complex and computationally hard especially for large graphs. So, research on simplifying graph comparison gainedan interest and several solutions are proposed. In this paper, we explore and evaluate a new solution for the comparison of large graphs. Our approach relies on a compact encoding of graphs called prime graphs. Prime graphs are smaller and simpler than the original ones but they retain the structure and properties of the encoded graphs. We propose to approximate the similarity between two graphs by comparing the corresponding prime graphs. Simulations results show that this approach is effective for large graphs.  相似文献   

16.
Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility—a way to find (approximate) clusters—for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (c-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy c-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.  相似文献   

17.
Determining a proper distance metric is often a crucial step for machine learning. In this paper, a boosting algorithm is proposed to learn a Mahalanobis distance metric. Similar to most boosting algorithms, the proposed algorithm improves a loss function iteratively. In particular, the loss function is defined in terms of hypothesis margins, and a metric matrix base-learner specific to the boosting framework is also proposed. Experimental results show that the proposed approach can yield effective Mahalanobis distance metrics for a variety of data sets, and demonstrate the feasibility of the proposed approach.  相似文献   

18.
目的 度量学习是机器学习与图像处理中依赖于任务的基础研究问题。由于实际应用背景复杂,在大量不可避免的噪声环境下,度量学习方法的性能受到一定影响。为了降低噪声影响,现有方法常用L1距离取代L2距离,这种方式可以同时减小相似样本和不相似样本的损失尺度,却忽略了噪声对类内和类间样本的不同影响。为此,本文提出了一种非贪婪的鲁棒性度量学习算法——基于L2/L1损失的边缘费歇尔分析(marginal Fisher analysis based on L2/L1 loss,MFA-L2/L1),采用更具判别性的损失,可提升噪声环境下的识别性能。方法 在边缘费歇尔分析(marginal Fisher analysis,MFA)方法的基础上,所提模型采用L2距离刻画相似样本损失、L1距离刻画不相似样本损失,同时加大对两类样本的惩罚程度以提升方法的判别性。首先,针对模型非凸带来的求解困难,将目标函数转为迭代两个凸函数之差便于求解;然后,受DCA(difference of convex functions algorithm)思想启发,推导出非贪婪的迭代求解算法,求得最终度量矩阵;最后,算法的理论证明保证了迭代算法的收敛性。结果 在5个UCI(University of California Irrine)数据集和7个人脸数据集上进行对比实验:1)在不同程度噪声的5个UCI数据集上,MFA-L2/L1算法最优,且具有较好的抗噪性,尤其在30%噪声程度的Seeds和Wine数据集上,与次优方法LDA-NgL1(non-greedy L1-norm linear discriminant analysis))相比,MFA-L2/L1的准确率高出9%;2)在不同维度的AR和FEI人脸数据集上的实验,验证了模型采用L1损失、采用L2损失提升了模型的判别性;3)在Senthil、Yale、ORL、Caltech和UMIST人脸数据集的仿真实验中,MFA-L2/L1算法呈现出较强鲁棒性,性能排名第1。结论 本文提出了一种基于L2/L1损失的鲁棒性度量学习模型,并推导了一种便捷有效的非贪婪式求解算法,进行了算法收敛性的理论分析。在不同数据集的不同噪声情况下的实验结果表明,所提算法具有较好的识别率和鲁棒性。  相似文献   

19.
Efficiently searching top-k representative vertices is crucial for understanding the structure of large dynamic graphs. Recent studies show that communities formed by a vertex with high local clustering coefficient and its neighbours can achieve enhanced information propagation speed as well as disease transmission speed. However, local clustering coefficient, which measures the cliquishness of a vertex in its local neighbourhood, prefers vertices with small degrees. To remedy this issue, in this paper we propose a new ranking measure, weighted clustering coefficient (WCC) of vertices, by integrating both local clustering coefficient and degree. WCC not only inherits the properties of local clustering coefficient but also approximately measures the density (i.e., average degree) of its neighbourhood subgraph. Thus, vertices with higher WCC are more likely to be representative. We study efficiently computing and monitoring top-k representative vertices based on WCC over large dynamic graphs. To reduce the search space, we propose a series of heuristic upper bounds for WCC to prune a large portion of disqualifying vertices from the search space. We also develop an approximation algorithm by utilizing Flajolet-Martin sketch to trade acceptable accuracy for enhanced efficiency. An efficient incremental algorithm dealing with frequent updates in dynamic graphs is explored as well. Extensive experimental results on a variety of real-life graph datasets demonstrate the efficiency and effectiveness of our approaches.  相似文献   

20.
Most existing representative works in semi-supervised clustering do not sufficiently solve the violation problem of pairwise constraints. On the other hand, traditional kernel methods for semi-supervised clustering not only face the problem of manually tuning the kernel parameters due to the fact that no sufficient supervision is provided, but also lack a measure that achieves better effectiveness of clustering. In this paper, we propose an adaptive Semi-supervised Clustering Kernel Method based on Metric learning (SCKMM) to mitigate the above problems. Specifically, we first construct an objective function from pairwise constraints to automatically estimate the parameter of the Gaussian kernel. Then, we use pairwise constraint-based K-means approach to solve the violation issue of constraints and to cluster the data. Furthermore, we introduce metric learning into nonlinear semi-supervised clustering to improve separability of the data for clustering. Finally, we perform clustering and metric learning simultaneously. Experimental results on a number of real-world data sets validate the effectiveness of the proposed method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号