首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.  相似文献   

2.
Matrix computations are both fundamental and ubiquitous in computational science, and as a result, they are frequently used in numerous disciplines of scientific computing and engineering. Due to the high computational complexity of matrix operations, which makes them critical to the performance of a large number of applications, their efficient execution in distributed environments becomes a crucial issue. This work proposes a novel approach for distributing sparse matrix arithmetic operations on computer clusters aiming at speeding-up the processing of high-dimensional matrices. The approach focuses on how to split such operations into independent parallel tasks by considering the intrinsic characteristics that distinguish each type of operation and the particular matrices involved. The approach was applied to the most commonly used arithmetic operations between matrices. The performance of the presented approach was evaluated considering a high-dimensional text feature selection approach and two real-world datasets. Experimental evaluation showed that the proposed approach helped to significantly reduce the computing times of big-scale matrix operations, when compared to serial and multi-thread implementations as well as several linear algebra software libraries.  相似文献   

3.
A hyperplane based indexing technique for high-dimensional data   总被引:1,自引:0,他引:1  
In this paper, we propose a novel hyperplane based indexing method to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the key dimension based index. Compared with the key dimension concept, the hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Extensive experiments based on two types of real data sets are conducted and the results illustrate a significantly improved filtering efficiency. Because of the feature of hyperplane, the proposed indexing method is only suitable to Euclidean spaces.  相似文献   

4.
Three-Dimensional (3D) Active Shape Modeling (ASM) is a straightforward extension of 2D ASM. 3D ASM is robust when true volumetric data is considered. However, when the information in one dimension is sparse, pure 3D ASM tends to be less robust. We present a hybrid 2D + 3D methodology which can deal with sparse 3D data. 2D and 3D ASMs are combined to obtain a “global optimal” segmentation of the 3D object embedded in the data set, rather than the “locally optimal” segmentation on separate slices. Experimental results indicate that the developed approach shows equivalent precision on separate slices but higher consistency for whole volumes when compared to 2D ASM, while the results for whole volumes are improved when compared to the pure 3D ASM approach. The text was submitted by the authors in English. Stuart Michael Williams, born in 1967, graduated with BAHons in 1989, BMBCh in 1992 from Oxford University, UK; MRCP (1995), FRCR(1999); Stuart Michael Williams is currently the Consultant Radiologist of Norfolk and Norwich University Hospital, Norwich, UK. His research areas include oncological radiology with an interest in image analysis and medical education. Stuart Michael Williams has 24 publications (monographs and articles). He is a member of the Royal College of Radiologists; member of the European Congress of Radiology; and a member of the European Society of Magnetic Resonance in Medicine and Biology. Yanong Zhu, born in 1975, graduated with B. Sci. in 1997 and M. Sci. in 2002 from Northwest University, China and PhD in 2006 from the University of East Anglia, Norwich, UK. His research areas include computer vision, medical image understanding, and analysis. Yanong Zhu has eight publications (monographs and articles). Reyer Zwiggelaar, born in 1963, graduated with B. Sci. from State University Groningen, the Netherlands in 1989. He was awarded his PhD in 1993 by University College London, UK. Reyer Zwiggelaar is currently the Senior Lecturer at the University of Wales Aberystwyth, UK. Dr. Zwiggelaar has more than 80 publications (monographs and articles). His research areas include medical image understanding, especially concentrating on mammographic data, pattern recognition, statistical methods, and feature detection techniques.  相似文献   

5.
6.
The emergence of cloud datacenters enhances the capability of online data storage. Since massive data is stored in datacenters, it is necessary to effectively locate and access interest data in such a distributed system. However, traditional search techniques only allow users to search images over exact-match keywords through a centralized index. These techniques cannot satisfy the requirements of content based image retrieval (CBIR). In this paper, we propose a scalable image retrieval framework which can efficiently support content similarity search and semantic search in the distributed environment. Its key idea is to integrate image feature vectors into distributed hash tables (DHTs) by exploiting the property of locality sensitive hashing (LSH). Thus, images with similar content are most likely gathered into the same node without the knowledge of any global information. For searching semantically close images, the relevance feedback is adopted in our system to overcome the gap between low-level features and high-level features. We show that our approach yields high recall rate with good load balance and only requires a few number of hops.  相似文献   

7.
In this paper we present a new method for Joint Feature Selection and Classifier Learning using a sparse Bayesian approach. These tasks are performed by optimizing a global loss function that includes a term associated with the empirical loss and another one representing a feature selection and regularization constraint on the parameters. To minimize this function we use a recently proposed technique, the Boosted Lasso algorithm, that follows the regularization path of the empirical risk associated with our loss function. We develop the algorithm for a well known non-parametrical classification method, the relevance vector machine, and perform experiments using a synthetic data set and three databases from the UCI Machine Learning Repository. The results show that our method is able to select the relevant features, increasing in some cases the classification accuracy when feature selection is performed.  相似文献   

8.
We propose a new data-mining method that is effective for learning from extremely high-dimensional data sets. Our proposed method selects a subset of features from a high-dimensional data set by a process of iterative refinement. Our selection of a feature-subset has two steps. The first step selects a subset of instances, to which predictions by hypotheses previously obtained are most unreliable, from the data set. The second step selects a subset of features whose values in the selected instances vary the most from those in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of four other methods, including one of the latest feature-subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods in terms of prediction accuracy, precision at a certain recall value, and computation time to reach a certain prediction accuracy. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels. Extended abstracts of parts of the work presented in this paper have appeared in Mamitsuka [14] and Mamitsuka [15]. Hiroshi Mamitsuka is currently Associate Professor in the Institute for Chemical Research at Kyoto University. He received his B.S. in Biochemistry and Biophysics, M.E. in Information Engineering and Ph.D. in Information Sciences from the University of Tokyo in 1988, 1991 and 1999, respectively. He worked in NEC Research Laboratories in Japan from 1991 to 2002. His current research interests are in bioinformatics, computational molecular biology, chemical genomics, medicinal chemistry, machine learning and data mining. Hiroshi Mamitsuka, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan. E-mail mami@kuicr.kyoto-u.ac.jp:  相似文献   

9.
Sparse representation is a mathematical model for data representation that has proved to be a powerful tool for solving problems in various fields such as pattern recognition, machine learning, and computer vision. As one of the building blocks of the sparse representation method, dictionary learning plays an important role in the minimization of the reconstruction error between the original signal and its sparse representation in the space of the learned dictionary. Although using training samples directly as dictionary bases can achieve good performance, the main drawback of this method is that it may result in a very large and inefficient dictionary due to noisy training instances. To obtain a smaller and more representative dictionary, in this paper, we propose an approach called Laplacian sparse dictionary (LSD) learning. Our method is based on manifold learning and double sparsity. We incorporate the Laplacian weighted graph in the sparse representation model and impose the l1-norm sparsity on the dictionary. An LSD is a sparse overcomplete dictionary that can preserve the intrinsic structure of the data and learn a smaller dictionary for each class. The learned LSD can be easily integrated into a classification framework based on sparse representation. We compare the proposed method with other methods using three benchmark-controlled face image databases, Extended Yale B, ORL, and AR, and one uncontrolled person image dataset, i-LIDS-MA. Results show the advantages of the proposed LSD algorithm over state-of-the-art sparse representation based classification methods.  相似文献   

10.
The discovery of structures hidden in high-dimensional data space is of great significance for understanding and further processing of the data. Real world datasets are often composed of multiple low dimensional patterns, the interlacement of which may impede our ability to understand the distribution rule of the data. Few of the existing methods focus on the detection and extraction of the manifolds representing distinct patterns. Inspired by the nonlinear dimensionality reduction method ISOmap, in this paper we present a novel approach called Multi-Manifold Partition to identify the interlacing low dimensional patterns. The algorithm has three steps: first a neighborhood graph is built to capture the intrinsic topological structure of the input data, then the dimensional uniformity of neighboring nodes is analyzed to discover the segments of patterns, finally the segments which are possibly from the same low-dimensional structure are combined to obtain a global representation of distribution rules. Experiments on synthetic data as well as real problems are reported. The results show that this new approach to exploratory data analysis is effective and may enhance our understanding of the data distribution.  相似文献   

11.
This paper proposes a novel cross-correlation neural network (CNN) model for finding the principal singular subspace of a cross-correlation matrix between two high-dimensional data streams. We introduce a novel nonquadratic criterion (NQC) for searching the optimum weights of two linear neural networks (LNN). The NQC exhibits a single global minimum attained if and only if the weight matrices of the left and right neural networks span the left and right principal singular subspace of a cross-correlation matrix, respectively. The other stationary points of the NQC are (unstable) saddle points. We develop an adaptive algorithm based on the NQC for tracking the principal singular subspace of a cross-correlation matrix between two high-dimensional vector sequences. The NQC algorithm provides a fast online learning of the optimum weights for two LNN. The global asymptotic stability of the NQC algorithm is analyzed. The NQC algorithm has several key advantages such as faster convergence, which is illustrated through simulations.  相似文献   

12.
A novel random-gradient-based algorithm is developed for online tracking the minor component (MC) associated with the smallest eigenvalue of the autocorrelation matrix of the input vector sequence. The five available learning algorithms for tracking one MC are extended to those for tracking multiple MCs or the minor subspace (MS). In order to overcome the dynamical divergence properties of some available random-gradient-based algorithms, we propose a modification of the Oja-type algorithms, called OJAm, which can work satisfactorily. The averaging differential equation and the energy function associated with the OJAm are given. It is shown that the averaging differential equation will globally asymptotically converge to an invariance set. The corresponding energy or Lyapunov functions exhibit a unique global minimum attained if and only if its state matrices span the MS of the autocorrelation matrix of a vector data stream. The other stationary points are saddle (unstable) points. The globally convergence of OJAm is also studied. The OJAm provides an efficient online learning for tracking the MS. It can track an orthonormal basis of the MS while the other five available algorithms cannot track any orthonormal basis of the MS. The performances of the relative algorithms are shown via computer simulations.  相似文献   

13.
International Journal of Information Security - This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion detection. It...  相似文献   

14.
Over the past few decades, biometric recognition firmly established itself as one of the areas of tremendous potential to make scientific discovery and to advance state-of-the- art research in security domain. Hardly, there is a single area of IT left untouched by increased vulnerabilities, on-line scams, e-fraud, illegal activities, and event pranks in virtual worlds. In parallel with biometric development, which went from focus on single biometric recognition methods to multi-modal information fusion, another rising area of research is virtual world’s security and avatar recognition. This article explores links between multi-biometric system architecture and virtual worlds face recognition, and proposes methodology which can be of benefit for both applications.  相似文献   

15.
A class of linear classification rules, specifically designed for high-dimensional problems, is proposed. The new rules are based on Gaussian factor models and are able to incorporate successfully the information contained in the sample correlations. Asymptotic results, that allow the number of variables to grow faster than the number of observations, demonstrate that the worst possible expected error rate of the proposed rules converges to the error of the optimal Bayes rule when the postulated model is true, and to a slightly larger constant when this model is a reasonable approximation to the data generating process. Numerical comparisons suggest that, when combined with appropriate variable selection strategies, rules derived from one-factor models perform comparably, or better, than the most successful extant alternatives under the conditions they were designed for. The proposed methods are implemented as an R package named HiDimDA, available from the CRAN repository.  相似文献   

16.
Transductive transfer learning is one special type of transfer learning problem, in which abundant labeled examples are available in the source domain and only unlabeled examples are available in the target domain. It easily finds applications in spam filtering, microblogging mining, and so on. In this paper, we propose a general framework to solve the problem by mapping the input features in both the source domain and the target domain into a shared latent space and simultaneously minimizing the feature reconstruction loss and prediction loss. We develop one specific example of the framework, namely latent large-margin transductive transfer learning algorithm, and analyze its theoretic bound of classification loss via Rademacher complexity. We also provide a unified view of several popular transfer learning algorithms under our framework. Experiment results on one synthetic dataset and three application datasets demonstrate the advantages of the proposed algorithm over the other state-of-the-art ones.  相似文献   

17.
Derived from the traditional manifold learning algorithms, local discriminant analysis methods identify the underlying submanifold structures while employing discriminative information for dimensionality reduction. Mathematically, they can all be unified into a graph embedding framework with different construction criteria. However, such learning algorithms are limited by the curse-of-dimensionality if the original data lie on the high-dimensional manifold. Different from the existing algorithms, we consider the discriminant embedding as a kernel analysis approach in the sample space, and a kernel-view based discriminant method is proposed for the embedded feature extraction, where both PCA pre-processing and the pruning of data can be avoided. Extensive experiments on the high-dimensional data sets show the robustness and outstanding performance of our proposed method.  相似文献   

18.
In big data applications, data privacy is one of the most concerned issues because processing large-scale privacy-sensitive data sets often requires computation resources provisioned by public cloud services. Sub-tree data anonymization is a widely adopted scheme to anonymize data sets for privacy preservation. Top–Down Specialization (TDS) and Bottom–Up Generalization (BUG) are two ways to fulfill sub-tree anonymization. However, existing approaches for sub-tree anonymization fall short of parallelization capability, thereby lacking scalability in handling big data in cloud. Still, either TDS or BUG individually suffers from poor performance for certain valuing of k-anonymity parameter. In this paper, we propose a hybrid approach that combines TDS and BUG together for efficient sub-tree anonymization over big data. Further, we design MapReduce algorithms for the two components (TDS and BUG) to gain high scalability. Experiment evaluation demonstrates that the hybrid approach significantly improves the scalability and efficiency of sub-tree anonymization scheme over existing approaches.  相似文献   

19.
Multimedia Tools and Applications - Training supervised machine learning models like deep learning requires high-quality labelled datasets that contain enough samples from various categories and...  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号