首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 26 毫秒
1.
离群点检测问题中的数据可被看作是正常点与异常点在空间中的高度混合,在减少正常点损失的前提下,离群点通常包含在离聚类中心最远的样本集中。受这种思想启发,提出一种针对高维稀疏数据的基于插值的离群点检测方法,该方法在K-means基础上应用遗传算法对原始数据进行插值处理,解决了K-means聚类中稀疏数据容易被合并的问题。实验结果表明,对比基于传统K-means聚类的离群点检测方法以及几种典型的基于改进K-means的检测方法,本文 方法损失的正常点更少,提高了检测的准确率和精确率。  相似文献   

2.
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.  相似文献   

3.
Matrix computations are both fundamental and ubiquitous in computational science, and as a result, they are frequently used in numerous disciplines of scientific computing and engineering. Due to the high computational complexity of matrix operations, which makes them critical to the performance of a large number of applications, their efficient execution in distributed environments becomes a crucial issue. This work proposes a novel approach for distributing sparse matrix arithmetic operations on computer clusters aiming at speeding-up the processing of high-dimensional matrices. The approach focuses on how to split such operations into independent parallel tasks by considering the intrinsic characteristics that distinguish each type of operation and the particular matrices involved. The approach was applied to the most commonly used arithmetic operations between matrices. The performance of the presented approach was evaluated considering a high-dimensional text feature selection approach and two real-world datasets. Experimental evaluation showed that the proposed approach helped to significantly reduce the computing times of big-scale matrix operations, when compared to serial and multi-thread implementations as well as several linear algebra software libraries.  相似文献   

4.
A hyperplane based indexing technique for high-dimensional data   总被引:1,自引:0,他引:1  
In this paper, we propose a novel hyperplane based indexing method to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the key dimension based index. Compared with the key dimension concept, the hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Extensive experiments based on two types of real data sets are conducted and the results illustrate a significantly improved filtering efficiency. Because of the feature of hyperplane, the proposed indexing method is only suitable to Euclidean spaces.  相似文献   

5.
在高维数据空间的子空间中对高维数据进行处理是减小甚至消除“维度灾难”的一个有效方法。为选择合理的子空间,提出了一种基于网格划分的子空间生成方法。在考虑数据集整体分布的前提下,对各维数据进行等深度的区间划分,为高维数据的后续相关处理奠定了良好的基础。  相似文献   

6.
7.
Three-Dimensional (3D) Active Shape Modeling (ASM) is a straightforward extension of 2D ASM. 3D ASM is robust when true volumetric data is considered. However, when the information in one dimension is sparse, pure 3D ASM tends to be less robust. We present a hybrid 2D + 3D methodology which can deal with sparse 3D data. 2D and 3D ASMs are combined to obtain a “global optimal” segmentation of the 3D object embedded in the data set, rather than the “locally optimal” segmentation on separate slices. Experimental results indicate that the developed approach shows equivalent precision on separate slices but higher consistency for whole volumes when compared to 2D ASM, while the results for whole volumes are improved when compared to the pure 3D ASM approach. The text was submitted by the authors in English. Stuart Michael Williams, born in 1967, graduated with BAHons in 1989, BMBCh in 1992 from Oxford University, UK; MRCP (1995), FRCR(1999); Stuart Michael Williams is currently the Consultant Radiologist of Norfolk and Norwich University Hospital, Norwich, UK. His research areas include oncological radiology with an interest in image analysis and medical education. Stuart Michael Williams has 24 publications (monographs and articles). He is a member of the Royal College of Radiologists; member of the European Congress of Radiology; and a member of the European Society of Magnetic Resonance in Medicine and Biology. Yanong Zhu, born in 1975, graduated with B. Sci. in 1997 and M. Sci. in 2002 from Northwest University, China and PhD in 2006 from the University of East Anglia, Norwich, UK. His research areas include computer vision, medical image understanding, and analysis. Yanong Zhu has eight publications (monographs and articles). Reyer Zwiggelaar, born in 1963, graduated with B. Sci. from State University Groningen, the Netherlands in 1989. He was awarded his PhD in 1993 by University College London, UK. Reyer Zwiggelaar is currently the Senior Lecturer at the University of Wales Aberystwyth, UK. Dr. Zwiggelaar has more than 80 publications (monographs and articles). His research areas include medical image understanding, especially concentrating on mammographic data, pattern recognition, statistical methods, and feature detection techniques.  相似文献   

8.
9.
《Parallel Computing》1988,7(2):199-210
We develop an algorithm for computing the symbolic Cholesky factorization of a large sparse symmetric positive definite matrix. The algorithm is intended for a message-passing multiprocessor system, such as the hypercube, and is based on the concept of elimination forest. In addition, we provide an algorithm for computing these forests along with a discussion of the algorithm's complexity and a proof of its correctness.  相似文献   

10.
The emergence of cloud datacenters enhances the capability of online data storage. Since massive data is stored in datacenters, it is necessary to effectively locate and access interest data in such a distributed system. However, traditional search techniques only allow users to search images over exact-match keywords through a centralized index. These techniques cannot satisfy the requirements of content based image retrieval (CBIR). In this paper, we propose a scalable image retrieval framework which can efficiently support content similarity search and semantic search in the distributed environment. Its key idea is to integrate image feature vectors into distributed hash tables (DHTs) by exploiting the property of locality sensitive hashing (LSH). Thus, images with similar content are most likely gathered into the same node without the knowledge of any global information. For searching semantically close images, the relevance feedback is adopted in our system to overcome the gap between low-level features and high-level features. We show that our approach yields high recall rate with good load balance and only requires a few number of hops.  相似文献   

11.
In this paper we present a new method for Joint Feature Selection and Classifier Learning using a sparse Bayesian approach. These tasks are performed by optimizing a global loss function that includes a term associated with the empirical loss and another one representing a feature selection and regularization constraint on the parameters. To minimize this function we use a recently proposed technique, the Boosted Lasso algorithm, that follows the regularization path of the empirical risk associated with our loss function. We develop the algorithm for a well known non-parametrical classification method, the relevance vector machine, and perform experiments using a synthetic data set and three databases from the UCI Machine Learning Repository. The results show that our method is able to select the relevant features, increasing in some cases the classification accuracy when feature selection is performed.  相似文献   

12.
The discovery of structures hidden in high-dimensional data space is of great significance for understanding and further processing of the data. Real world datasets are often composed of multiple low dimensional patterns, the interlacement of which may impede our ability to understand the distribution rule of the data. Few of the existing methods focus on the detection and extraction of the manifolds representing distinct patterns. Inspired by the nonlinear dimensionality reduction method ISOmap, in this paper we present a novel approach called Multi-Manifold Partition to identify the interlacing low dimensional patterns. The algorithm has three steps: first a neighborhood graph is built to capture the intrinsic topological structure of the input data, then the dimensional uniformity of neighboring nodes is analyzed to discover the segments of patterns, finally the segments which are possibly from the same low-dimensional structure are combined to obtain a global representation of distribution rules. Experiments on synthetic data as well as real problems are reported. The results show that this new approach to exploratory data analysis is effective and may enhance our understanding of the data distribution.  相似文献   

13.
We propose a new data-mining method that is effective for learning from extremely high-dimensional data sets. Our proposed method selects a subset of features from a high-dimensional data set by a process of iterative refinement. Our selection of a feature-subset has two steps. The first step selects a subset of instances, to which predictions by hypotheses previously obtained are most unreliable, from the data set. The second step selects a subset of features whose values in the selected instances vary the most from those in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of four other methods, including one of the latest feature-subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods in terms of prediction accuracy, precision at a certain recall value, and computation time to reach a certain prediction accuracy. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels. Extended abstracts of parts of the work presented in this paper have appeared in Mamitsuka [14] and Mamitsuka [15]. Hiroshi Mamitsuka is currently Associate Professor in the Institute for Chemical Research at Kyoto University. He received his B.S. in Biochemistry and Biophysics, M.E. in Information Engineering and Ph.D. in Information Sciences from the University of Tokyo in 1988, 1991 and 1999, respectively. He worked in NEC Research Laboratories in Japan from 1991 to 2002. His current research interests are in bioinformatics, computational molecular biology, chemical genomics, medicinal chemistry, machine learning and data mining. Hiroshi Mamitsuka, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan. E-mail mami@kuicr.kyoto-u.ac.jp:  相似文献   

14.
Sparse representation is a mathematical model for data representation that has proved to be a powerful tool for solving problems in various fields such as pattern recognition, machine learning, and computer vision. As one of the building blocks of the sparse representation method, dictionary learning plays an important role in the minimization of the reconstruction error between the original signal and its sparse representation in the space of the learned dictionary. Although using training samples directly as dictionary bases can achieve good performance, the main drawback of this method is that it may result in a very large and inefficient dictionary due to noisy training instances. To obtain a smaller and more representative dictionary, in this paper, we propose an approach called Laplacian sparse dictionary (LSD) learning. Our method is based on manifold learning and double sparsity. We incorporate the Laplacian weighted graph in the sparse representation model and impose the l1-norm sparsity on the dictionary. An LSD is a sparse overcomplete dictionary that can preserve the intrinsic structure of the data and learn a smaller dictionary for each class. The learned LSD can be easily integrated into a classification framework based on sparse representation. We compare the proposed method with other methods using three benchmark-controlled face image databases, Extended Yale B, ORL, and AR, and one uncontrolled person image dataset, i-LIDS-MA. Results show the advantages of the proposed LSD algorithm over state-of-the-art sparse representation based classification methods.  相似文献   

15.
In this paper we describe general software utilities for performing unstructured sparse matrix–vector multiplications on distributed-memory message-passing computers. The matrix–vector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our focus is to present the data structures and communication parameters required by these utilities for general sparse unstructured matrices with data locality. These types of matrices are commonly produced by finite difference and finite element approximations to systems of partial differential equations. In this discussion we also present representative examples and timings which demonstrate the utility and performance of the software. © 1998 John Wiley & Sons, Ltd.  相似文献   

16.
This paper proposes a novel cross-correlation neural network (CNN) model for finding the principal singular subspace of a cross-correlation matrix between two high-dimensional data streams. We introduce a novel nonquadratic criterion (NQC) for searching the optimum weights of two linear neural networks (LNN). The NQC exhibits a single global minimum attained if and only if the weight matrices of the left and right neural networks span the left and right principal singular subspace of a cross-correlation matrix, respectively. The other stationary points of the NQC are (unstable) saddle points. We develop an adaptive algorithm based on the NQC for tracking the principal singular subspace of a cross-correlation matrix between two high-dimensional vector sequences. The NQC algorithm provides a fast online learning of the optimum weights for two LNN. The global asymptotic stability of the NQC algorithm is analyzed. The NQC algorithm has several key advantages such as faster convergence, which is illustrated through simulations.  相似文献   

17.
A novel random-gradient-based algorithm is developed for online tracking the minor component (MC) associated with the smallest eigenvalue of the autocorrelation matrix of the input vector sequence. The five available learning algorithms for tracking one MC are extended to those for tracking multiple MCs or the minor subspace (MS). In order to overcome the dynamical divergence properties of some available random-gradient-based algorithms, we propose a modification of the Oja-type algorithms, called OJAm, which can work satisfactorily. The averaging differential equation and the energy function associated with the OJAm are given. It is shown that the averaging differential equation will globally asymptotically converge to an invariance set. The corresponding energy or Lyapunov functions exhibit a unique global minimum attained if and only if its state matrices span the MS of the autocorrelation matrix of a vector data stream. The other stationary points are saddle (unstable) points. The globally convergence of OJAm is also studied. The OJAm provides an efficient online learning for tracking the MS. It can track an orthonormal basis of the MS while the other five available algorithms cannot track any orthonormal basis of the MS. The performances of the relative algorithms are shown via computer simulations.  相似文献   

18.
A class of linear classification rules, specifically designed for high-dimensional problems, is proposed. The new rules are based on Gaussian factor models and are able to incorporate successfully the information contained in the sample correlations. Asymptotic results, that allow the number of variables to grow faster than the number of observations, demonstrate that the worst possible expected error rate of the proposed rules converges to the error of the optimal Bayes rule when the postulated model is true, and to a slightly larger constant when this model is a reasonable approximation to the data generating process. Numerical comparisons suggest that, when combined with appropriate variable selection strategies, rules derived from one-factor models perform comparably, or better, than the most successful extant alternatives under the conditions they were designed for. The proposed methods are implemented as an R package named HiDimDA, available from the CRAN repository.  相似文献   

19.
Over the past few decades, biometric recognition firmly established itself as one of the areas of tremendous potential to make scientific discovery and to advance state-of-the- art research in security domain. Hardly, there is a single area of IT left untouched by increased vulnerabilities, on-line scams, e-fraud, illegal activities, and event pranks in virtual worlds. In parallel with biometric development, which went from focus on single biometric recognition methods to multi-modal information fusion, another rising area of research is virtual world’s security and avatar recognition. This article explores links between multi-biometric system architecture and virtual worlds face recognition, and proposes methodology which can be of benefit for both applications.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号