共查询到20条相似文献,搜索用时 15 毫秒
1.
Isomap is one of widely used low-dimensional embedding methods, where geodesic distances on a weighted graph are incorporated with the classical scaling (metric multidimensional scaling). In this paper we pay our attention to two critical issues that were not considered in Isomap, such as: (1) generalization property (projection property); (2) topological stability. Then we present a robust kernel Isomap method, armed with such two properties. We present a method which relates the Isomap to Mercer kernel machines, so that the generalization property naturally emerges, through kernel principal component analysis. For topological stability, we investigate the network flow in a graph, providing a method for eliminating critical outliers. The useful behavior of the robust kernel Isomap is confirmed through numerical experiments with several data sets. 相似文献
2.
Recently, the Isomap algorithm has been proposed for learning a parameterized manifold from a set of unorganized samples from the manifold. It is based on extending the classical multidimensional scaling method for dimension reduction, replacing pairwise Euclidean distances by the geodesic distances on the manifold. A continuous version of Isomap called continuum Isomap is proposed. Manifold learning in the continuous framework is then reduced to an eigenvalue problem of an integral operator. It is shown that the continuum Isomap can perfectly recover the underlying parameterization if the mapping associated with the parameterized manifold is an isometry and its domain is convex. The continuum Isomap also provides a natural way to compute low-dimensional embeddings for out-of-sample data points. Some error bounds are given for the case when the isometry condition is violated. Several illustrative numerical examples are also provided. 相似文献
3.
Minimum class variance support vector machine (MCVSVM) and large margin linear projection (LMLP) classifier, in contrast with traditional support vector machine (SVM), take the distribution information of the data into consideration and can obtain better performance. However, in the case of the singularity of the within-class scatter matrix, both MCVSVM and LMLP only exploit the discriminant information in a single subspace of the within-class scatter matrix and discard the discriminant information in the other subspace. In this paper, a so-called twin-space support vector machine (TSSVM) algorithm is proposed to deal with the high-dimensional data classification task where the within-class scatter matrix is singular. TSSVM is rooted in both the non-null space and the null space of the within-class scatter matrix, takes full advantage of the discriminant information in the two subspaces, and so can achieve better classification accuracy. In the paper, we first discuss the linear case of TSSVM, and then develop the nonlinear TSSVM. Experimental results on real datasets validate the effectiveness of TSSVM and indicate its superior performance over MCVSVM and LMLP. 相似文献
4.
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification
is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods
in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside
the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory
substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents.
In addition, the technique can also be extended to cost sensitive classification. We show the effectiveness of the method
with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semi-structured
data.
Editors: Hendrik Blockeel, David Jensen and Stefan Kramer
An erratum to this article is available at . 相似文献
5.
基于递归最小二乘支持向量机,提出了一种网络业务流量非线性预测算法。通过最小二乘支持量机首先将原始的网络流量数据映射到一个高维空间中,进而在这个高维空间中对流量数据进行预测,使得在低维空间中非线性预测转化为高维空间中的线性预测,提高了预测性能。仿真结果表明,预测误差能够维持在5%以内。 相似文献
6.
A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally optimal solution by successive transformations that improve a gain function. The gain function combines the mean squared residue, the row variance, and the size of the bicluster. Different strategies to escape local minima are introduced and compared. Experimental results on several microarray data sets show that the method is able to find significant biclusters, also from a biological point of view. 相似文献
7.
8.
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign different ranks to the same gene. As a result, different techniques identify different gene subsets, which may not be the set of significant genes. To overcome such problems, in this study pipelining the ranking techniques is suggested. In each stage of pipeline, few of the lower ranked features are eliminated and at the end a relatively good subset of feature is preserved. However, the order in which the ranking techniques are used in the pipeline is important to ensure that the significant genes are preserved in the final subset. For this experimental study, twenty four unique pipeline models are generated out of four gene ranking strategies. These pipelines are tested with seven different microarray databases to find the suitable pipeline for such task. Further the gene subset obtained is tested with four classifiers and four performance metrics are evaluated. No single pipeline dominates other pipelines in performance; therefore a grading system is applied to the results of these pipelines to find out a consistent model. The finding of grading system that a pipeline model is significant is also established by Nemenyi post-hoc hypothetical test. Performance of this pipeline model is compared with four ranking techniques, though its performance is not superior always but majority of time it yields better results and can be suggested as a consistent model. However it requires more computational time in comparison to single ranking techniques. 相似文献
9.
《Graphical Models》2014,76(2):103-114
We present a visualization system for exploring the high-dimensional graphical data, such as textures or 3D models, in 2D space using the dimensionality reduction method. To arrange high-dimensional data in a meaningful 2D space, we develop a novel semi-supervised dimensionality reduction method that can embed data of high dimension in a user-defined 2D coordinate system that is meaningful in terms of the properties of the data. This is achieved by modifying the Isomap method by weighting the data so that the resulting coordinates have no degeneracies and are orthogonal. 相似文献
10.
Mohd Saberi Mohamad Sigeru Omatu Safaai Deris Muhammad Faiz Misman Michifumi Yoshioka 《Artificial Life and Robotics》2009,13(2):414-417
Gene expression technology, namely microarrays, offers the ability to measure the expression levels of thousands of genes
simultaneously in biological organisms. Microarray data are expected to be of significant help in the development of an efficient
cancer diagnosis and classification platform. A major problem in these data is that the number of genes greatly exceeds the
number of tissue samples. These data also have noisy genes. It has been shown in literature reviews that selecting a small
subset of informative genes can lead to improved classification accuracy. Therefore, this paper aims to select a small subset
of informative genes that are most relevant for cancer classification. To achieve this aim, an approach using two hybrid methods
has been proposed. This approach is assessed and evaluated on two well-known microarray data sets, showing competitive results.
This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January
31–February 2, 2008 相似文献
11.
Automatic text classification is usually based on models constructed through learning from training examples. However, as the size of text document repositories grows rapidly, the storage requirements and computational cost of model learning is becoming ever higher. Instance selection is one solution to overcoming this limitation. The aim is to reduce the amount of data by filtering out noisy data from a given training dataset. A number of instance selection algorithms have been proposed in the literature, such as ENN, IB3, ICF, and DROP3. However, all of these methods have been developed for the k-nearest neighbor (k-NN) classifier. In addition, their performance has not been examined over the text classification domain where the dimensionality of the dataset is usually very high. The support vector machines (SVM) are core text classification techniques. In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS), is proposed. First of all, a regression plane in the original feature space is identified by utilizing a threshold distance between the given training instances and their class centers. Then, another threshold distance, between the identified data (forming the regression plane) and the regression plane, is used to decide on the support vectors for the selected instances. The experimental results based on the TechTC-100 dataset show the superior performance of SVOIS over other state-of-the-art algorithms. In particular, using SVOIS to select text documents allows the k-NN and SVM classifiers perform better than without instance selection. 相似文献
12.
Gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. We present a new heuristic to select relevant gene subsets in order to further use them for the classification task. Our method is based on the statistical significance of adding a gene from a ranked-list to the final subset. The efficiency and effectiveness of our technique is demonstrated through extensive comparisons with other representative heuristics. Our approach shows an excellent performance, not only at identifying relevant genes, but also with respect to the computational cost. 相似文献
13.
《Expert systems with applications》2014,41(11):5520-5525
With the growth of computational culture, the concept of cultural syntactic unit has been investigated quite intensively in recent years. The existing approaches of extracting music feature always consider putting notes together according to certain rules. However, sometimes, it might be better take the integrity of melody into account. In this paper, we introduce a syntactic unit named music gene and propose a music feature extraction method for era classification. Music genes are extracted from XML files and estimated utilizing their intervals to investigate their representation of music feature. We evaluate the performance of music gene and the previous proposed motif to illustrate the effectiveness of music gene in classification. Support vector machines are applied to classify XML files into their respective classes by learning from training data. We obtain the classification accuracy for a collection of 764 music genes, demonstrating that significant classification can be achieved using higher level music feature. 相似文献
14.
Automatic land cover analysis for Tenerife by supervised classification using remotely sensed data 总被引:5,自引:0,他引:5
Automatic land cover classification from satellite images is an important topic in many remote sensing applications. In this paper, we consider three different statistical approaches to tackle this problem: two of them, namely the well-known maximum likelihood classification (ML) and the support vector machine (SVM), are noncontextual methods. The third one, iterated conditional modes (ICM), exploits spatial context by using a Markov random field. We apply these methods to Landsat 5 Thematic Mapper (TM) data from Tenerife, the largest of the Canary Islands. Due to the size and the strong relief of the island, ground truth data could be collected only sparsely by examination of test areas for previously defined land cover classes.We show that after application of an unsupervised clustering method to identify subclasses, all classification algorithms give satisfactory results (with statistical overall accuracy of about 90%) if the model parameters are selected appropriately. Although being superior to ML theoretically, both SVM and ICM have to be used carefully: ICM is able to improve ML, but when applied for too many iterations, spatially small sample areas are smoothed away, leading to statistically slightly worse classification results. SVM yields better statistical results than ML, but when investigated visually, the classification result is not completely satisfying. This is due to the fact that no a priori information on the frequency of occurrence of a class was used in this context, which helps ML to limit the unlikely classes. 相似文献
15.
Changshui Zhang Author Vitae Jun Wang Author VitaeAuthor Vitae David Zhang Author Vitae 《Pattern recognition》2004,37(2):325-336
Locally linear embedding (LLE) is a nonlinear dimensionality reduction method proposed recently. It can reveal the intrinsic distribution of data, which cannot be provided by classical linear dimensionality reduction methods. The application of LLE, however, is limited because of its lack of a parametric mapping between the observation and the low-dimensional output. And the large data set to be reduced is necessary. In this paper, we propose methods to establish the process of mapping from low-dimensional embedded space to high-dimensional space for LLE and validate their efficiency with the application of reconstruction of multi-pose face images. Furthermore, we propose that the high-dimensional structure of multi-pose face images is similar for the same kind of pose change mode of different persons. So given the structure information of data distribution which is obtained by leaning large numbers of multi-pose images in a training set, the support vector regression (SVR) method of statistical learning theory is used to learn the high-dimensional structure of someone based on small sets. The detailed learning method and algorithm are given and applied to reconstruct and synthesize face images in small set cases. The experiments prove that our idea and method is correct. 相似文献
16.
We present a two-step method to speed-up object detection systems in computer vision that use support vector machines as classifiers. In the first step we build a hierarchy of classifiers. On the bottom level, a simple and fast linear classifier analyzes the whole image and rejects large parts of the background. On the top level, a slower but more accurate classifier performs the final detection. We propose a new method for automatically building and training a hierarchy of classifiers. In the second step we apply feature reduction to the top level classifier by choosing relevant image features according to a measure derived from statistical learning theory. Experiments with a face detection system show that combining feature reduction with hierarchical classification leads to a speed-up by a factor of 335 with similar classification performance. 相似文献
17.
Donghyeon Yu 《Computational statistics & data analysis》2012,56(3):510-521
A paired data set is common in microarray experiments, where the data are often incompletely observed for some pairs due to various technical reasons. In microarray paired data sets, it is of main interest to detect differentially expressed genes, which are usually identified by testing the equality of means of expressions within a pair. While much attention has been paid to testing mean equality with incomplete paired data in previous literature, the existing methods commonly assume the normality of data or rely on the large sample theory. In this paper, we propose a new test based on permutations, which is free from the normality assumption and large sample theory. We consider permutation statistics with linear mixtures of paired and unpaired samples as test statistics, and propose a procedure to find the optimal mixture that minimizes the conditional variances of the test statistics, given the observations. Simulations are conducted for numerical power comparisons between the proposed permutation tests and other existing methods. We apply the proposed method to find differentially expressed genes for a colorectal cancer study. 相似文献
18.
Accurate recognition of cancers based on microarray gene expressions is very important for doctors to choose a proper treatment. Genomic microarrays are powerful research tools in bioinformatics and modern medicinal research. However, a simple microarray experiment often leads to very high-dimensional data and a huge amount of information, the vast amount of data challenges researchers into extracting the important features and reducing the high dimensionality. This paper proposed the kernel method based locally linear embedding to selecting the optimal number of nearest neighbors, constructing uniform distribution manifold. In this paper, a nonlinear dimensionality reduction kernel method based locally linear embedding is proposed to select the optimal number of nearest neighbors, constructing uniform distribution manifold. In addition, support vector machine which has given rise to the development of a new class of theoretically elegant learning machines will be used to classify and recognise genomic microarray. We demonstrate the application of the techniques to two published DNA microarray data sets. The experimental results and comparisons demonstrate that the proposed method is effective approach. 相似文献
19.
Support vector machines with genetic fuzzy feature transformation for biomedical data classification
In this paper, we present a genetic fuzzy feature transformation method for support vector machines (SVMs) to do more accurate data classification. Given data are first transformed into a high feature space by a fuzzy system, and then SVMs are used to map data into a higher feature space and then construct the hyperplane to make a final decision. Genetic algorithms are used to optimize the fuzzy feature transformation so as to use the newly generated features to help SVMs do more accurate biomedical data classification under uncertainty. The experimental results show that the new genetic fuzzy SVMs have better generalization abilities than the traditional SVMs in terms of prediction accuracy. 相似文献
20.
K. Chidananda Gowda 《Pattern recognition》1984,17(6):667-676
A new scheme, incorporating dimensionality reduction and clustering, suitable for classification of a large volume of remotely sensed data using a small amount of memory is proposed. The scheme involves transforming the data from multidimensional n-space to a 3-dimensional primary color space of blue, green and red coordinates. The dimensionality reduction is followed by data reduction, which involves assigning 3-dimensional samples to a 2-dimensional array. Finally, a multi-stage ISODATA technique incorporating a novel seedpoint picking method is used to obtain the desired number of clusters.
The storage requirements are reduced to a low value by making five passes through the data and storing necessary information during each pass. The first three passes are used to find the minimum and maximum values of some of the variables. The data reduction is done and a classification table is formed during the fourth pass. The classification map is obtained during the fifth pass. The computer memory required is about 2K machine words.
The efficacy of the algorithm is justified by simulation studies using multispectral LANDSAT data. 相似文献