共查询到20条相似文献,搜索用时 0 毫秒
1.
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques. 相似文献
2.
This paper proposes a locality correlation preserving based support vector machine (LCPSVM) by combining the idea of margin maximization between classes and local correlation preservation of class data. It is a Support Vector Machine (SVM) like algorithm, which explicitly considers the locality correlation within each class in the margin and the penalty term of the optimization function. Canonical correlation analysis (CCA) is used to reveal the hidden correlations between two datasets, and a variant of correlation analysis model which implements locality preserving has been proposed by integrating local information into the objective function of CCA. Inspired by the idea used in canonical correlation analysis, we propose a locality correlation preserving within-class scatter matrix to replace the within-class scatter matrix in minimum class variance support machine (MCVSVM). This substitution has the property of keeping the locality correlation of data, and inherits the properties of SVM and other similar modified class of support vector machines. LCPSVM is discussed under linearly separable, small sample size and nonlinearly separable conditions, and experimental results on benchmark datasets demonstrate its effectiveness. 相似文献
3.
CAIM discretization algorithm 总被引:8,自引:0,他引:8
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. We describe such an algorithm, called CAIM (class-attribute interdependence maximization), which is designed to work with supervised data. The goal of the CAIM algorithm is to maximize the class-attribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CAIM and six other state-of-the-art discretization algorithms show that discrete attributes generated by the CAIM algorithm almost always have the lowest number of intervals and the highest class-attribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CAIM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAIM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was achieved for data sets discretized with the CAIM algorithm, as compared with the other six algorithms. 相似文献
4.
Data discretization unification 总被引:1,自引:1,他引:1
5.
The problem of recursive estimation of an additive noise-corrupted discrete stochastic process is considered for the case where there is a nonzero probability that the observation does not contain the process. Specifically, it is assumed that, independently, with unknown, constant probabilities, observations consist either of pure noise, or derive from a discrete linear process, and that the true source of any individual observation is never known. The optimal Bayesian solution to this unsupervised learning problem is unfortunately infeasible in practice, due to an ever increasing computer time and memory requirement, and computationally feasible approximations are necessary. In this correspondence a quasi-Bayes (QB) form of approximation is proposed and comparisons are made with the well-known decision-directed (DD) and probabilistic-teacher (PT) schemes. 相似文献
6.
Crisp discretization is one of the most widely used methods for handling continuous attributes. In crisp discretization, each
attribute is split into several intervals and handled as discrete numbers. Although crisp discretization is a convenient tool,
it is not appropriate in some situations (e.g., when there is no clear boundary and we cannot set a clear threshold). To address
such a problem, several discretizations with fuzzy sets have been proposed. In this paper we examine the effect of fuzzy discretization
derived from crisp discretization. The fuzziness of fuzzy discretization is controlled by a fuzzification grade F. We examine two procedures for the setting of F. In one procedure, we set F beforehand and do not change it through training rule-based classifiers. In the other procedure, first we set F and then change it after training. Through computational experiments, we show that the accuracy of rule-based classifiers
is improved by an appropriate setting of the grade of fuzzification. Moreover, we show that increasing the grade of fuzzification
after training classifiers can often improve generalization ability.
This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January
31–February 2, 2008 相似文献
7.
8.
Hierarchical unsupervised fuzzy clustering 总被引:5,自引:0,他引:5
A recursive algorithm for hierarchical fuzzy partitioning is presented. The algorithm has the advantages of hierarchical clustering, while maintaining fuzzy clustering rules. Each pattern can have a nonzero membership in more than one subset of the data in the hierarchy. Optimal feature extraction and reduction is optionally reapplied for each subset. Combining hierarchical and fuzzy concepts is suggested as a natural feasible solution to the cluster validity problem of real data. The convergence and membership conservation of the algorithm are proven. The algorithm is shown to be effective for a variety of data sets with a wide dynamic range of both covariance matrices and number of members in each class 相似文献
9.
《Journal of Web Semantics》2008,6(3):218-236
The Semantic Web’s need for machine understandable content has led researchers to attempt to automatically acquire such content from a number of sources, including the web. To date, such research has focused on “document-driven” systems that individually process a small set of documents, annotating each with respect to a given ontology. This article introduces OntoSyphon, an alternative that strives to more fully leverage existing ontological content while scaling to extract comparatively shallow content from millions of documents. OntoSyphon operates in an “ontology-driven” manner: taking any ontology as input, OntoSyphon uses the ontology to specify web searches that identify possible semantic instances, relations, and taxonomic information. Redundancy in the web, together with information from the ontology, is then used to automatically verify these candidate instances and relations, enabling OntoSyphon to operate in a fully automated, unsupervised manner. A prototype of OntoSyphon is fully implemented and we present experimental results that demonstrate substantial instance population in three domains based on independently constructed ontologies. We show that using the whole web as a corpus for verification yields the best results, but that using a much smaller web corpus can also yield strong performance. In addition, we consider the problem of selecting the best class for each candidate instance that is discovered, and the problem of ranking the final results. For both problems we introduce new solutions and demonstrate that, for both the small and large corpora, they consistently improve upon previously known techniques. 相似文献
10.
We show how the quantum paradigm can be used to speed up unsupervised learning algorithms. More precisely, we explain how it is possible to accelerate learning algorithms by quantizing some of their subroutines. Quantization refers to the process that partially or totally converts a classical algorithm to its quantum counterpart in order to improve performance. In particular, we give quantized versions of clustering via minimum spanning tree, divisive clustering and k-medians that are faster than their classical analogues. We also describe a distributed version of k-medians that allows the participants to save on the global communication cost of the protocol compared to the classical version. Finally, we design quantum algorithms for the construction of a neighbourhood graph, outlier detection as well as smart initialization of the cluster centres. 相似文献
11.
Ling Chen Chuandong Li Tingwen Huang Yiran Chen Xin Wang 《Neural computing & applications》2014,25(2):393-400
This letter presents a new memristor crossbar array system and demonstrates its applications in image learning. The controlled pulse and image overlay technique are introduced for the programming of memristor crossbars and promising a better performance for noise reduction. The time-slot technique is helpful for improving the processing speed of image. Simulink and numerical simulations have been employed to demonstrate the useful applications of the proposed circuit structure in image learning. 相似文献
12.
Sensor devices and embedded processors are becoming widespread, especially in measurement/monitoring applications. Their limited resources (CPU, memory and/or communication bandwidth, and power) pose some interesting challenges. We need concise, expressive models to represent the important features of the data and that lend themselves to efficient estimation. In particular, under these severe constraints, we want models and estimation methods that (a) require little memory and a single pass over the data, (b) can adapt and handle arbitrary periodic components, and (c) can deal with various types of noise. We propose
(Arbitrary Window Stream mOdeling Method), which allows sensors in remote or hostile environments to efficiently and effectively discover interesting patterns and trends. This can be done automatically, i.e., with no prior inspection of the data or any user intervention and expert tuning before or during data gathering. Our algorithms require limited resources and can thus be incorporated into sensors - possibly alongside a distributed query processing engine [10,6,27]. Updates are performed in constant time with respect to stream size using logarithmic space. Existing forecasting methods (SARIMA, GARCH, etc.) and traditional Fourier and wavelet analysis fall short on one or more of these requirements. To the best of our knowledge,
is the first framework that combines all of the above characteristics. Experiments on real and synthetic datasets demonstrate that
discovers meaningful patterns over long time periods. Thus, the patterns can also be used to make long-range forecasts, which are notoriously difficult to perform. In fact,
outperforms manually set up autoregressive models, both in terms of long-term pattern detection and modeling and by at least 10 x in resource consumption.Received: 2 January 2004, Accepted: 23 March 2004, Published online: 12 August 2004Edited by: S. AbitebouAnthony Brockwell: This material is based upon work supported by the National Science Foundation under Grants Nos. DMS-9819950 and IIS-0083148.Christos Faloutsos: This material is based upon work supported by the National Science Foundation under Grants Nos. IIS-9817496, IIS-9988876, IIS-0083148, IIS-0113089, IIS-0209107, IIS-0205224, INT-0318547, SE NSOR-0329549, EF-0331657, and IIS-0326322, by the Pennsylvania Infrastructure Technology Alliance (PITA) Grant No. 22-901-0001, and by the Defense Advanced Research Projects Agency under Contract No. N66001-00-1-8936. Additional funding was provided by donations from Intel and by a gift from Northrop-Grumman Corporation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties. 相似文献
13.
Nguyen S.H离散化算法中定义的初始断点集由于可能包含了部分对决策系统的分辨关系并无贡献的断点而影响到算法的效率.通过定义分界点来对该算法中定义的初始断点以对决策系统的分辨关系是否有贡献来进行区分,并仅取分界点集作为初始断点集,使得初始断点数目较大幅度地降低,提出了一种改进的启发式离散化算法.此算法较大程度地减小了算法空间复杂性和时间复杂性,对比实验结果表明了改进算法的正确性和有效性. 相似文献
14.
15.
D. B. Rozhdestvenskii 《Automation and Remote Control》2006,67(12):1991-2001
An algorithm was obtained for restoration of a continuous process from a finite number of its equispaced discrete samples. The necessary and sufficient sampling conditions were formulated. 相似文献
16.
Semi-supervised classification methods aim to exploit labeled and unlabeled examples to train a predictive model. Most of these approaches make assumptions on the distribution of classes. This article first proposes a new semi-supervised discretization method, which adopts very low informative prior on data. This method discretizes the numerical domain of a continuous input variable, while keeping the information relative to the prediction of classes. Then, an in-depth comparison of this semi-supervised method with the original supervised MODL approach is presented. We demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, improved with a post-optimization of the intervals bounds location. 相似文献
17.
Microsystem Technologies - This research focuses on bot detection through implementation of techniques such as traffic analysis, unsupervised machine learning, and similarity analysis between... 相似文献
18.
Neural-network front ends in unsupervised learning 总被引:1,自引:0,他引:1
Proposed is an idea of partial supervision realized in the form of a neural-network front end to the schemes of unsupervised learning (clustering). This neural network leads to an anisotropic nature of the induced feature space. The anisotropic property of the space provides us with some of its local deformation necessary to properly represent labeled data and enhance efficiency of the mechanisms of clustering to be exploited afterwards. The training of the network is completed based upon available labeled patterns-a referential form of the labeling gives rise to reinforcement learning. It is shown that the discussed approach is universal and can be utilized in conjunction with any clustering method. Experimental studies are concentrated on three main categories of unsupervised learning including FUZZY ISODATA, Kohonen self-organizing maps, and hierarchical clustering. 相似文献
19.
Meinicke P Klanke S Memisevic R Ritter H 《IEEE transactions on pattern analysis and machine intelligence》2005,27(9):1379-1391
We propose a nonparametric approach to learning of principal surfaces based on an unsupervised formulation of the Nadaraya-Watson kernel regression estimator. As compared with previous approaches to principal curves and surfaces, the new method offers several advantages: First, it provides a practical solution to the model selection problem because all parameters can be estimated by leave-one-out cross-validation without additional computational cost. In addition, our approach allows for a convenient incorporation of nonlinear spectral methods for parameter initialization, beyond classical initializations based on linear PCA. Furthermore, it shows a simple way to fit principal surfaces in general feature spaces, beyond the usual data space setup. The experimental results illustrate these convenient features on simulated and real data. 相似文献
20.