首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Co-clustering treats a data matrix in a symmetric fashion that a partitioning of rows can induce a partitioning of columns, and vice versa. It has been shown advantageous over tradition clustering. However, the computational complexity of most co-clustering algorithms are costly, and thus limit their e?ectiveness on large datasets. A recently proposed sampling-based matrix decomposition method can achieve a linear computational complexity, but selected rows and columns can not effectively represent a large sparse dataset, and many unselected rows and columns can not be mapped to the selected rows and columns because they do not share features in common, thus its performance is impaired. To address this problem, we propose a fast co-clustering framework by ranking and sampling that only representative samples are selected for co-clustering, and the remaining samples can be easily labeled by their neighbors in clustered samples. Extensive experiments on large text datasets show that our approach is able to use very few samples to achieve comparable results in linear time compared to state-of-the-art co-clustering algorithms of nonlinear computational complexity.  相似文献   

2.
协同聚类是对数据矩阵的行和列两个方向同时进行聚类的一类算法。本文将双层加权的思想引入协同聚类,提出了一种双层子空间加权协同聚类算法(TLWCC)。TLWCC对聚类块(co-cluster)加一层权重,对行和列再加一层权重,并且算法在迭代过程中自动计算块、行和列这三组权重。TLWCC考虑不同的块、行和列与相应块、行和列中心的距离,距离越大,认为其噪声越强,就给予小权重;反之噪声越弱,给予大权重。通过给噪声信息小权重,TLWCC能有效地降低噪声信息带来的干扰,提高聚类效果。本文通过四组实验展示TLWCC算法识别噪声信息的能力、参数选取对算法聚类结果的影响程度,算法的聚类性能和时间性能。  相似文献   

3.
准确而积极地向用户提供他们可能感兴趣的信息或服务是推荐系统的主要任务。协同过滤是采用得最广泛的推荐算法之一,而数据稀疏的问题往往严重影响推荐质量。为了解决这个问题,提出了基于二分图划分联合聚类的协同过滤推荐算法。首先将用户与项目构建成二分图进行联合聚类,从而映射到低维潜在特征空间;其次根据聚类结果改进2种相似性计算策略:簇偏好相似性和评分相似性,并将二者相结合。基于结合的相似性,分别采用基于用户和项目的方法来获得对未知目标评分的预测。最后,将这些预测结果进行融合。实验结果表明,所提算法比最新的联合聚类协同过滤推荐算法具有更好的性能。  相似文献   

4.
Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing starstructured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive experiments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm.  相似文献   

5.
Clustering ensemble is a popular approach for identifying data clusters that combines the clustering results from multiple base clustering algorithms to produce more accurate and robust data clusters. However, the performance of clustering ensemble algorithms is highly dependent on the quality of clustering members. To address this problem, this paper proposes a member enhancement-based clustering ensemble (MECE) algorithm that selects the ensemble members by considering their distribution consistency. MECE has two main components, called heterocluster splitting and homocluster merging. The first component estimates two probability density functions (p.d.f.s) estimated on the sample points of an heterocluster and represents them using a Gaussian distribution and a Gaussian mixture model. If the random numbers generated by these two p.d.f.s have different probability distributions, the heterocluster is then split into smaller clusters. The second component merges the clusters that have high neighborhood densities into a homocluster, where the neighborhood density is measured using a novel evaluation criterion. In addition, a co-association matrix is presented, which serves as a summary for the ensemble of diverse clusters. A series of experiments were conducted to evaluate the feasibility and effectiveness of the proposed ensemble member generation algorithm. Results show that the proposed MECE algorithm can select high quality ensemble members and as a result yield the better clusterings than six state-of-the-art ensemble clustering algorithms, that is, cluster-based similarity partitioning algorithm (CSPA), meta-clustering algorithm (MCLA), hybrid bipartite graph formulation (HBGF), evidence accumulation clustering (EAC), locally weighted evidence accumulation (LWEA), and locally weighted graph partition (LWGP). Specifically, MECE algorithm has the nearly 23% higher average NMI, 27% higher average ARI, 15% higher average FMI, and 10% higher average purity than CSPA, MCLA, HBGF, EAC, LWEA, and LWGA algorithms. The experimental results demonstrate that MECE algorithm is a valid approach to deal with the clustering ensemble problems.  相似文献   

6.
贾洪杰  丁世飞  史忠植 《软件学报》2015,26(11):2836-2846
谱聚类将聚类问题转化成图划分问题,是一种基于代数图论的聚类方法.在求解图划分目标函数时,一般利用Rayleigh熵的性质,通过计算Laplacian矩阵的特征向量将原始数据点映射到一个低维的特征空间中,再进行聚类.然而在谱聚类过程中,存储相似矩阵的空间复杂度是O(n2),对Laplacian矩阵特征分解的时间复杂度一般为O(n3),这样的复杂度在处理大规模数据时是无法接受的.理论证明,Normalized Cut图聚类与加权核k-means都等价于矩阵迹的最大化问题.因此,可以用加权核k-means算法来优化Normalized Cut的目标函数,这就避免了对Laplacian矩阵特征分解.不过,加权核k-means算法需要计算核矩阵,其空间复杂度依然是O(n2).为了应对这一挑战,提出近似加权核k-means算法,仅使用核矩阵的一部分来求解大数据的谱聚类问题.理论分析和实验对比表明,近似加权核k-means的聚类表现与加权核k-means算法是相似的,但是极大地减小了时间和空间复杂性.  相似文献   

7.
徐森  皋军  徐秀芳  花小朋  徐静  安晶 《控制与决策》2018,33(12):2208-2212
将二部图模型引入聚类集成问题中,使用二部图模型同时建模对象集和超边集,充分挖掘潜藏在对象之间的相似度信息和超边提供的属性信息.设计正则化谱聚类算法解决二部图划分问题,在低维嵌入空间运行K-means++算法划分对象集,获得最终的聚类结果.在多组基准数据集上进行实验,实验结果表明所提出方法不仅能获得优越的结果,而且具有较高的运行效率.  相似文献   

8.
Co-clustering methods are valuable parsimonious approaches for the analysis of a binary data table by a simultaneous partitioning of the rows or the columns. Bringing the property of visualization to co-clustering is of first importance for a fast access to the essential topics and their relations. We propose a new generative self-organizing map by a particular parameterization of the Bernoulli block mixture model. The method is called block GTM or topographic block model. Thanks to the underlying probabilistic framework, the inference of the parameters of the method is performed with the block EM algorithm. At the maximization step, two local quadratic approximations of the objective function arise from a second-order optimization, respectively, with the Newton–Raphson algorithm and with a variational bound of the sigmoid function. In the experiments with several datasets, the two algorithms are able to outperform former approaches and lead to similar results when the parameters are regularized with a \(L_1\) -norm. The conclusion summarizes the contribution and some perspectives.  相似文献   

9.
Multi-way partitioning of an undirected weighted graph where pairwise similarities are assigned as edge weights, provides an important tool for data clustering, but is an NP-hard problem. Spectral relaxation is a popular way of relaxation, leading to spectral clustering where the clustering is performed by the eigen-decomposition of the (normalized) graph Laplacian. On the other hand, semidefinite relaxation, is an alternative way of relaxing a combinatorial optimization, leading to a convex optimization. In this paper we employ a semidefinite programming (SDP) approach to the graph equipartitioning for clustering, where sufficient conditions for strong duality hold. The method is referred to as semidefinite spectral clustering, where the clustering is based on the eigen-decomposition of the optimal feasible matrix computed by SDP. Numerical experiments with several data sets, demonstrate the useful behavior of our semidefinite spectral clustering, compared to existing spectral clustering methods.  相似文献   

10.
The row-wise multiple comparison procedure proposed in Hirotsu [Hirotsu, C., 1977. Multiple comparisons and clustering rows in a contingency table. Quality 7, 27–33 (in Japanese); Hirotsu, C., 1983. Defining the pattern of association in two-way contingency tables. Biometrika 70, 579–589] has been verified to be useful for clustering rows and/or columns of a contingency table in several applications. Although the method improved the preceding work there was still a gap between the squared distance between the two clusters of rows and the largest root of a Wishart matrix as a reference statistic for evaluating the significance of the clustering. In this paper we extend the squared distance to a generalized squared distance among any number of rows or clusters of rows and dissolves the loss of power in the process of the clustering procedure. If there is a natural ordering in columns we define an order sensitive squared distance and then the reference distribution becomes that of the largest root of a non-orthogonal Wishart matrix, which is very difficult to handle. We therefore propose a very nice χ2-approximation which improves the usual normal approximation in Anderson [Anderson, T.W., 2003. An Introduction to Multivariate Statistical Analysis. 3rd ed. Wiley Intersciences, New York] and also the first χ2-approximation introduced in Hirotsu [Hirotsu, C., 1991. An approach to comparing treatments based on repeated measures. Biometrika 75, 583–594]. A two-way table reported by Guttman [Guttman, L., 1971. Measurement as structural theory. Psychometrika 36, 329–347] and analyzed by Greenacre [Greenacre, M.J., 1988. Clustering the rows and columns of a contingency table. Journal of Classification 5, 39–51] is reanalyzed and a very nice interpretation of the data has been obtained.  相似文献   

11.
The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman–Kruskal’s τ, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.  相似文献   

12.
Co-clustering with augmented matrix   总被引:1,自引:1,他引:0  
  相似文献   

13.
A bipartite graph is a powerful abstraction for modeling relationships between two collections. Visualizations of bipartite graphs allow users to understand the mutual relationships between the elements in the two collections, e.g., by identifying clusters of similarly connected elements. However, commonly‐used visual representations do not scale for the analysis of large bipartite graphs containing tens of millions of vertices, often resorting to an a‐priori clustering of the sets. To address this issue, we present the Who's‐Active‐On‐What‐Visualization (WAOW‐Vis) that allows for multiscale exploration of a bipartite social‐network without imposing an a‐priori clustering. To this end, we propose to treat a bipartite graph as a high‐dimensional space and we create the WAOW‐Vis adapting the multiscale dimensionality‐reduction technique HSNE. The application of HSNE for bipartite graph requires several modifications that form the contributions of this work. Given the nature of the problem, a set‐based similarity is proposed. For efficient and scalable computations, we use compressed bitmaps to represent sets and we present a novel space partitioning tree to efficiently compute similarities; the Sets Intersection Tree. Finally, we validate WAOW‐Vis on several datasets connecting Twitter‐users and ‐streams in different domains: news, computer science and politics. We show how WAOW‐Vis is particularly effective in identifying hierarchies of communities among social‐media users.  相似文献   

14.
In traditional co-clustering, the only basis for the clustering task is a given relationship matrix, describing the strengths of the relationships between pairs of elements in the different domains. Relying on this single input matrix, co-clustering discovers relationships holding among groups of elements from the two input domains. In many real life applications, on the other hand, other background knowledge or metadata about one or more of the two input domain dimensions may be available and, if leveraged properly, such metadata might play a significant role in the effectiveness of the co-clustering process. How additional metadata affects co-clustering, however, depends on how the process is modified to be context-aware. In this paper, we propose, compare, and evaluate three alternative strategies (metadata-driven, metadata-constrained, and metadata-injected co-clustering) for embedding available contextual knowledge into the co-clustering process. Experimental results show that it is possible to leverage the available metadata in discovering contextually-relevant co-clusters, without significant overheads in terms of information theoretical co-cluster quality or execution cost.  相似文献   

15.
Auditory scenes are temporal audio segments with coherent semantic content. Automatically classifying and grouping auditory scenes with similar semantics into categories is beneficial for many multimedia applications, such as semantic event detection and indexing. For such semantic categorization, auditory scenes are first characterized with either low-level acoustic features or some mid-level representations like audio effects, and then supervised classifiers or unsupervised clustering algorithms are employed to group scene segments into various semantic categories. In this paper, we focus on the problem of automatically categorizing audio scenes in unsupervised manner. To achieve more reasonable clustering results, we introduce the co-clustering scheme to exploit potential grouping trends among different dimensions of feature spaces (either low-level or mid-level feature spaces), and provide more accurate similarity measure for comparing auditory scenes. Moreover, we also extend the co-clustering scheme with a strategy based on the Bayesian information criterion (BIC) to automatically estimate the numbers of clusters. Evaluation performed on 272 auditory scenes extracted from 12-h audio data shows very encouraging categorization results. Co-clustering achieved a better performance compared to some traditional one-way clustering algorithms, both based on the low-level acoustic features and on the mid-level audio effect representations. Finally, we present our vision regarding the applicability of this approach on general multimedia data, and also show some preliminary results on content-based image clustering.  相似文献   

16.
A.  U.  M.  K. 《Future Generation Computer Systems》2005,21(8):1275-1284
For the solution of sparse linear systems from circuit simulation whose coefficient matrices include a few dense rows and columns, a parallel iterative algorithm with distributed Schur complement preconditioning is presented. The parallel efficiency of the solver is increased by transforming the equation system into a problem without dense rows and columns as well as by exploitation of parallel graph partitioning methods. The costs of local, incomplete LU decompositions are decreased by fill-in reducing reordering methods of the matrix and a threshold strategy for the factorization. The efficiency of the parallel solver is demonstrated with real circuit simulation problems on PC clusters.  相似文献   

17.
We say, that a subset K of the columns of a matrix is called a key, if every two rows that agree in the columns of K agree also in all columns of the matrix. A matrix represents a Sperner system K, if the system of minimal keys is exactly K. It is known, that such a representation always exists. In this paper we show, that the maximum of the minimum number of rows, that are needed to represent a Sperner system of only two element sets is 3(n/3+o(n)). We consider this problem for other classes of Sperner systems (e.g., for the class of trees, i.e. each minimal key has cardinality two, and the keys form a tree), too. The concept of keys plays an important role in database theory.  相似文献   

18.
目的 现有的图匹配算法大多应用于二维图像,对三维图像的特征点匹配存在匹配准确率低和计算速度慢等问题。为解决这些问题,本文将分解图匹配算法扩展应用在了三维图像上。方法 首先将需要匹配的两个三维图像的特征点作为图的节点集;再通过Delaunay三角剖分算法,将三维特征点相连,则相连得到的边就作为图的边集,从而建立有向图;然后,根据三维图像的特征点构建相应的三维有向图及其邻接矩阵;再根据有向图中的节点特征和边特征分别构建节点特征相似矩阵和边特征相似矩阵;最后根据这两个特征矩阵将节点匹配问题转化为求极值问题并求解。结果 实验表明,在手工选取特征点的情况下,本文算法对相同三维图像的特征点匹配有97.56%的平均准确率;对不同三维图像特征点匹配有76.39%的平均准确率;在三维图像有旋转的情况下,有90%以上的平均准确率;在特征点部分缺失的情况下,平均匹配准确率也能达到80%。在通过三维尺度不变特征变换(SIFT)算法得到特征点的情况下,本文算法对9个三维模型的特征点的平均匹配准确率为98.78%。结论 本文提出的基于图论的三维图像特征点匹配算法,经实验结果验证,可以取得较好的匹配效果。  相似文献   

19.
刘怡俊  龙锦涛  杨晓君 《计算机应用研究》2023,40(4):1246-1249+1274
针对传统模糊聚类算法对初始聚类中心非常敏感以及对高光谱图像处理效果不佳的问题,为减少聚类数据的复杂度、降低聚类过程的计算成本以提升聚类性能,提出了一种基于多层二部图的高光谱模糊聚类算法。首先使用SuperPCA预处理方法对超像素分割得到的每个同质区域进行PCA来学习HSI数据不同区域的固有低维特征,从而获得高光谱数据的低维表示;其次,构造一个多层二部图矩阵来描述数据点和锚点之间的关系,降低了计算复杂度;最后,在模糊聚类中加入基于多层二部图的非负正则项来约束模糊隶属度矩阵的解空间。在Indian Pines和Pavia University数据集上进行的实验表明,所提算法能提高聚类效果与性能。  相似文献   

20.
While spectral clustering can produce high-quality clusterings on small data sets, computational cost makes it infeasible for large data sets. Affinity Propagation (AP) has a limitation that it is hard to determine the value of parameter ‘preference’ which can lead to an optimal clustering solution. These problems limit the scope of application of the two methods. In this paper, we develop a novel fast two-stage spectral clustering framework with local and global consistency. Under this framework, we propose a Fast density-Weighted low-rank Approximation Spectral Clustering (FWASC) algorithm to address the above issues. The proposed algorithm is a high-quality graph partitioning method, and simultaneously considers both the local and global structure information contained in the data sets. Specifically, we first present a new Fast Two-Stage AP (FTSAP) algorithm to coarsen the input sparse graph and produce a small number of final representative exemplars, which is a simple and efficient sampling scheme. Then we present a density-weighted low-rank approximation spectral clustering algorithm to operate those representative exemplars on the global underlying structure of data manifold. Experimental results show that our algorithm outperforms the state-of-the-art spectral clustering and original AP algorithms in terms of speed, memory usage, and quality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号