首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.  相似文献   

2.
解亚萍 《计算机应用》2011,31(5):1409-1412
很多数据挖掘方法只能处理离散值的属性,因此,连续属性必须进行离散化。提出一种统计相关系数的数据离散化方法,基于统计相关理论有效地捕获了类-属性间的相互依赖,选取最佳断点。此外,将变精度粗糙集(VPRS)模型纳入离散化中,有效地控制数据的信息丢失。将所提方法在乳腺癌症诊断以及其他领域数据上进行了应用,实验结果表明,该方法显著地提高了See5决策树的分类学习精度。  相似文献   

3.
This paper proposes a locality correlation preserving based support vector machine (LCPSVM) by combining the idea of margin maximization between classes and local correlation preservation of class data. It is a Support Vector Machine (SVM) like algorithm, which explicitly considers the locality correlation within each class in the margin and the penalty term of the optimization function. Canonical correlation analysis (CCA) is used to reveal the hidden correlations between two datasets, and a variant of correlation analysis model which implements locality preserving has been proposed by integrating local information into the objective function of CCA. Inspired by the idea used in canonical correlation analysis, we propose a locality correlation preserving within-class scatter matrix to replace the within-class scatter matrix in minimum class variance support machine (MCVSVM). This substitution has the property of keeping the locality correlation of data, and inherits the properties of SVM and other similar modified class of support vector machines. LCPSVM is discussed under linearly separable, small sample size and nonlinearly separable conditions, and experimental results on benchmark datasets demonstrate its effectiveness.  相似文献   

4.
一种新的有监督的局部保持典型相关分析算法   总被引:2,自引:0,他引:2       下载免费PDF全文
从模式识别的角度出发,在局部保持典型相关分析的基础上,提出一种有监督的局部保持典型相关分析算法(SALPCCA)。该方法在构造样本近邻图时将样本的类别信息考虑在内,由样本间的距离度量确定权重,建立样本间的多重权重相关,通过使同类内的成对样本及其近邻间的权重相关性最大,从而能够在利用样本的类别信息的同时,也能保持数据的局部结构信息。此外,为了能够更好地提取样本的非线性信息,将特征集映射到核特征空间,又提出一种核化的SALPCCA(KSALPCCA)算法。在ORL、Yale、AR等人脸数据库上的实验结果表明,该方法较其他的传统典型相关分析方法有着更好的识别效果。  相似文献   

5.
现有的多标记学习技术大多只考虑了相关性学习问题而忽略了数据因变换而引起的结构性质不一致问题,导致原始特征数据的结构性质因映射变换发生改变,从而影响了模型的分类性能。为了解决这一问题,提出了基于结构性质保持和相关性学习的多标记分类算法。首先,构造了线性映射函数以实现特征空间与标记空间的映射;然后借鉴图正则化思想,引入基于特征数据的结构性质保持策略以降低特征数据因线性变换引起的结构性质差异;最后,针对标记数据引入基于标记对的相关性学习策略进一步优化算法参数,以提高模型的分类性能。在不同规模的标准数据集上进行测试,结果表明所提算法与一些流行的多标记分类算法相比具有更优的分类性能,验证了所提算法的有效性。  相似文献   

6.
A multiresolution state-space discretization method with pseudorandom gridding is developed for the episodic unsupervised learning method of Q-learning.It is used as the learning agent for closed-loop control of morphing or highly reconfigurable systems.This paper develops a method whereby a state-space is adaptively discretized by progressively finer pseudorandom grids around the regions of interest within the state or learning space in an effort to break the Curse of Dimensionality.Utility of the method is demonstrated with application to the problem of a morphing airfoil,which is simulated by a computationally intensive computational fiuid dynamics model.By setting the multiresolution method to define the region of interest by the goal the agent seeks,it is shown that this method with the pseudorandom grid can learn a specific goal within ±0.001 while reducing the total number of state-action pairs needed to achieve this level of specificity to less than 3000.  相似文献   

7.
CAIM discretization algorithm   总被引:8,自引:0,他引:8  
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. We describe such an algorithm, called CAIM (class-attribute interdependence maximization), which is designed to work with supervised data. The goal of the CAIM algorithm is to maximize the class-attribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CAIM and six other state-of-the-art discretization algorithms show that discrete attributes generated by the CAIM algorithm almost always have the lowest number of intervals and the highest class-attribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CAIM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAIM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was achieved for data sets discretized with the CAIM algorithm, as compared with the other six algorithms.  相似文献   

8.
Data discretization unification   总被引:2,自引:1,他引:1  
  相似文献   

9.
Crisp discretization is one of the most widely used methods for handling continuous attributes. In crisp discretization, each attribute is split into several intervals and handled as discrete numbers. Although crisp discretization is a convenient tool, it is not appropriate in some situations (e.g., when there is no clear boundary and we cannot set a clear threshold). To address such a problem, several discretizations with fuzzy sets have been proposed. In this paper we examine the effect of fuzzy discretization derived from crisp discretization. The fuzziness of fuzzy discretization is controlled by a fuzzification grade F. We examine two procedures for the setting of F. In one procedure, we set F beforehand and do not change it through training rule-based classifiers. In the other procedure, first we set F and then change it after training. Through computational experiments, we show that the accuracy of rule-based classifiers is improved by an appropriate setting of the grade of fuzzification. Moreover, we show that increasing the grade of fuzzification after training classifiers can often improve generalization ability. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

10.
11.
The problem of recursive estimation of an additive noise-corrupted discrete stochastic process is considered for the case where there is a nonzero probability that the observation does not contain the process. Specifically, it is assumed that, independently, with unknown, constant probabilities, observations consist either of pure noise, or derive from a discrete linear process, and that the true source of any individual observation is never known. The optimal Bayesian solution to this unsupervised learning problem is unfortunately infeasible in practice, due to an ever increasing computer time and memory requirement, and computationally feasible approximations are necessary. In this correspondence a quasi-Bayes (QB) form of approximation is proposed and comparisons are made with the well-known decision-directed (DD) and probabilistic-teacher (PT) schemes.  相似文献   

12.
Unsupervised domain adaptation (UDA), which aims to use knowledge from a label-rich source domain to help learn unlabeled target domain, has recently attracted much attention. UDA methods mainly concentrate on source classification and distribution alignment between domains to expect the correct target prediction. While in this paper, we attempt to learn the target prediction end to end directly, and develop a Self-corrected unsupervised domain adaptation (SCUDA) method with probabilistic label correction. SCUDA adopts a probabilistic label corrector to learn and correct the target labels directly. Specifically, besides model parameters, those target pseudo-labels are also updated in learning and corrected by the anchor-variable, which preserves the class candidates for samples. Experiments on real datasets show the competitiveness of SCUDA.  相似文献   

13.
Hierarchical unsupervised fuzzy clustering   总被引:5,自引:0,他引:5  
A recursive algorithm for hierarchical fuzzy partitioning is presented. The algorithm has the advantages of hierarchical clustering, while maintaining fuzzy clustering rules. Each pattern can have a nonzero membership in more than one subset of the data in the hierarchy. Optimal feature extraction and reduction is optionally reapplied for each subset. Combining hierarchical and fuzzy concepts is suggested as a natural feasible solution to the cluster validity problem of real data. The convergence and membership conservation of the algorithm are proven. The algorithm is shown to be effective for a variety of data sets with a wide dynamic range of both covariance matrices and number of members in each class  相似文献   

14.
提出了一种频数监督断点的离散化算法。该算法利用所提出的频数监督断点思想产生初始断点,并在此基础上进行断点简约。实验结果表明该算法所产生的断点不仅符合实际数据分布,而且更为合理、精练。  相似文献   

15.
The Semantic Web’s need for machine understandable content has led researchers to attempt to automatically acquire such content from a number of sources, including the web. To date, such research has focused on “document-driven” systems that individually process a small set of documents, annotating each with respect to a given ontology. This article introduces OntoSyphon, an alternative that strives to more fully leverage existing ontological content while scaling to extract comparatively shallow content from millions of documents. OntoSyphon operates in an “ontology-driven” manner: taking any ontology as input, OntoSyphon uses the ontology to specify web searches that identify possible semantic instances, relations, and taxonomic information. Redundancy in the web, together with information from the ontology, is then used to automatically verify these candidate instances and relations, enabling OntoSyphon to operate in a fully automated, unsupervised manner. A prototype of OntoSyphon is fully implemented and we present experimental results that demonstrate substantial instance population in three domains based on independently constructed ontologies. We show that using the whole web as a corpus for verification yields the best results, but that using a much smaller web corpus can also yield strong performance. In addition, we consider the problem of selecting the best class for each candidate instance that is discovered, and the problem of ranking the final results. For both problems we introduce new solutions and demonstrate that, for both the small and large corpora, they consistently improve upon previously known techniques.  相似文献   

16.
Sensor devices and embedded processors are becoming widespread, especially in measurement/monitoring applications. Their limited resources (CPU, memory and/or communication bandwidth, and power) pose some interesting challenges. We need concise, expressive models to represent the important features of the data and that lend themselves to efficient estimation. In particular, under these severe constraints, we want models and estimation methods that (a) require little memory and a single pass over the data, (b) can adapt and handle arbitrary periodic components, and (c) can deal with various types of noise. We propose (Arbitrary Window Stream mOdeling Method), which allows sensors in remote or hostile environments to efficiently and effectively discover interesting patterns and trends. This can be done automatically, i.e., with no prior inspection of the data or any user intervention and expert tuning before or during data gathering. Our algorithms require limited resources and can thus be incorporated into sensors - possibly alongside a distributed query processing engine [10,6,27]. Updates are performed in constant time with respect to stream size using logarithmic space. Existing forecasting methods (SARIMA, GARCH, etc.) and traditional Fourier and wavelet analysis fall short on one or more of these requirements. To the best of our knowledge, is the first framework that combines all of the above characteristics. Experiments on real and synthetic datasets demonstrate that discovers meaningful patterns over long time periods. Thus, the patterns can also be used to make long-range forecasts, which are notoriously difficult to perform. In fact, outperforms manually set up autoregressive models, both in terms of long-term pattern detection and modeling and by at least 10 x in resource consumption.Received: 2 January 2004, Accepted: 23 March 2004, Published online: 12 August 2004Edited by: S. AbitebouAnthony Brockwell: This material is based upon work supported by the National Science Foundation under Grants Nos. DMS-9819950 and IIS-0083148.Christos Faloutsos: This material is based upon work supported by the National Science Foundation under Grants Nos. IIS-9817496, IIS-9988876, IIS-0083148, IIS-0113089, IIS-0209107, IIS-0205224, INT-0318547, SE NSOR-0329549, EF-0331657, and IIS-0326322, by the Pennsylvania Infrastructure Technology Alliance (PITA) Grant No. 22-901-0001, and by the Defense Advanced Research Projects Agency under Contract No. N66001-00-1-8936. Additional funding was provided by donations from Intel and by a gift from Northrop-Grumman Corporation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.  相似文献   

17.
This letter presents a new memristor crossbar array system and demonstrates its applications in image learning. The controlled pulse and image overlay technique are introduced for the programming of memristor crossbars and promising a better performance for noise reduction. The time-slot technique is helpful for improving the processing speed of image. Simulink and numerical simulations have been employed to demonstrate the useful applications of the proposed circuit structure in image learning.  相似文献   

18.
We show how the quantum paradigm can be used to speed up unsupervised learning algorithms. More precisely, we explain how it is possible to accelerate learning algorithms by quantizing some of their subroutines. Quantization refers to the process that partially or totally converts a classical algorithm to its quantum counterpart in order to improve performance. In particular, we give quantized versions of clustering via minimum spanning tree, divisive clustering and k-medians that are faster than their classical analogues. We also describe a distributed version of k-medians that allows the participants to save on the global communication cost of the protocol compared to the classical version. Finally, we design quantum algorithms for the construction of a neighbourhood graph, outlier detection as well as smart initialization of the cluster centres.  相似文献   

19.
一种基于聚类的无监督异常检测方法   总被引:2,自引:0,他引:2  
为了解决无监督异常检测方法无法检测突发性的大规模攻击的问题,提出了一种基于聚类的无监督异常检测模型,该模型从多个聚类器中选取DB指数最小的分簇结果,并利用最小簇内距离、最大簇内距离对每个簇进行分类,从而识别出攻击。实验表明该模型明显提高了检测率、降低了误报率。  相似文献   

20.
Random relations are random sets defined on a two-dimensional space (or higher). After defining the correlation for two variables constrained by a random relation as an interval, the effect of imprecision was studied by using a multi-valued mapping, whose domain is a space of joint random variables. This perspective led to the notions of consistent and non-consistent marginals, which parallel those of epistemic independence, and unknown interaction and epistemic independence for random sets, respectively. The calculation of the correlation bounds entails solving two optimisation problems that are NP-hard. When the entire random relation is available, it is shown that the hypothesis of non-consistent marginals leads to correlation bounds that are much larger (four orders of magnitude in some cases) than those obtained under the hypothesis of consistent marginals; this hierarchy parallels the hierarchy between probability bounds for unknown interaction and strong independence, respectively. Solutions of the optimisation problems were found at the extremes of their feasible intervals in 80–100% of the cases when non-consistent marginals were assumed, but this range became 75–84% when consistent marginals were assumed. When only the marginals are available, there is a complete loss of knowledge in the correlation, and the correlation interval is nearly vacuous or vacuous (i.e. [ ? 1,1]) even if the measurements are sufficiently accurate in which their narrowed intervals do not overlap. Solutions to the optimisation problems were found at the extremes of their feasible intervals 50% or less of the times.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号