首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Neighborhood rough set based heterogeneous feature subset selection   总被引:6,自引:0,他引:6  
Feature subset selection is viewed as an important preprocessing step for pattern recognition, machine learning and data mining. Most of researches are focused on dealing with homogeneous feature selection, namely, numerical or categorical features. In this paper, we introduce a neighborhood rough set model to deal with the problem of heterogeneous feature subset selection. As the classical rough set model can just be used to evaluate categorical features, we generalize this model with neighborhood relations and introduce a neighborhood rough set model. The proposed model will degrade to the classical one if we specify the size of neighborhood zero. The neighborhood model is used to reduce numerical and categorical features by assigning different thresholds for different kinds of attributes. In this model the sizes of the neighborhood lower and upper approximations of decisions reflect the discriminating capability of feature subsets. The size of lower approximation is computed as the dependency between decision and condition attributes. We use the neighborhood dependency to evaluate the significance of a subset of heterogeneous features and construct forward feature subset selection algorithms. The proposed algorithms are compared with some classical techniques. Experimental results show that the neighborhood model based method is more flexible to deal with heterogeneous data.  相似文献   

2.
Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.  相似文献   

3.
Qinghua Hu  Jinfu Liu  Daren Yu 《Knowledge》2008,21(4):294-304
Feature subset selection presents a common challenge for the applications where data with tens or hundreds of features are available. Existing feature selection algorithms are mainly designed for dealing with numerical or categorical attributes. However, data usually comes with a mixed format in real-world applications. In this paper, we generalize Pawlak’s rough set model into δ neighborhood rough set model and k-nearest-neighbor rough set model, where the objects with numerical attributes are granulated with δ neighborhood relations or k-nearest-neighbor relations, while objects with categorical features are granulated with equivalence relations. Then the induced information granules are used to approximate the decision with lower and upper approximations. We compute the lower approximations of decision to measure the significance of attributes. Based on the proposed models, we give the definition of significance of mixed features and construct a greedy attribute reduction algorithm. We compare the proposed algorithm with others in terms of the number of selected features and classification performance. Experiments show the proposed technique is effective.  相似文献   

4.
This paper is dedicated to erroneous data detection and imputation methods in surveys. We describe experiments conducted under the scope of a European project for studying new statistical methods based on neural networks. We show that the self-organising map can be used successfully for these tasks. A self-organising map is calibrated according to the available observations, described through a set of correlated variables handled together. The map can then be used both to detect erroneous data and to impute values to partial observations. We apply these principles to a real size transport survey database. We show that the performance of our imputation model compares well to other classical methods, and that the use of a self-organising map for data correction provides a performing system fordata validation, data correction and data analysis.  相似文献   

5.
Mixed data sets containing numerical and categorical attributes are nowadays ubiquitous. Converting them to one attribute type may lead to a loss of information. We present an approach for handling numerical and categorical attributes in a holistic view. For data sets with many attributes, dimensionality reduction (DR) methods can help to generate visual representations involving all attributes. While automatic DR for mixed data sets is possible using weighted combinations, the impact of each attribute on the resulting projection is difficult to measure. Interactive support allows the user to understand the impact of data dimensions in the formation of patterns. Star Coordinates is a well-known interactive linear DR technique for multi-dimensional numerical data sets. We propose to extend Star Coordinates and its initial configuration schemes to mixed data sets. In conjunction with analysing numerical attributes, our extension allows for exploring the impact of categorical dimensions and individual categories on the structure of the entire data set. The main challenge when interacting with Star Coordinates is typically to find a good configuration of the attribute axes. We propose a guided mixed data analysis based on maximizing projection quality measures by the use of recommended transformations, named hints, in order to find a proper configuration of the attribute axes.  相似文献   

6.
针对名义型属性和数值型属性并存的混合型数据,结合多粒度邻域粗糙集和直觉模糊集,分别定义模糊覆盖粗糙隶属度和非隶属度.基于不同的属性集序列和不同的邻域半径,构建多粒度邻域粗糙直觉模糊集模型,证明模型相关性质.然后提出乐观和悲观多粒度邻域粗糙直觉模糊集的近似集,并讨论模型性质.最后使用文中模型计算实例,说明其能较好地解决名义型属性和数值型属性的混合型数据的处理问题.  相似文献   

7.
有混合数据输入的自适应模糊神经推理系统   总被引:1,自引:0,他引:1  
现有数据建模方法大多依赖于定量的数值信息,而对于数值与分类混合输入的数据建模问题往往根据分类变量组合建立多个子模型,当有多个分类变量输入时易出现子模型数据分布不均匀、训练耗时长等问题.针对上述问题,提出一种具有混合数据输入的自适应模糊神经推理系统模型,在自适应模糊推理系统的基础上,引入激励强度转移矩阵和结论影响矩阵,采用基于高氏距离的减法聚类辨识模型结构,通过混合学习算法训练模型参数,使数值与分类混合数据对模糊规则的前后件参数同时产生作用,共同影响模型输出.仿真实验分析了分类数据对模型规则后件的作用以及结构辨识算法对模糊规则数的影响,与其他几种混合数据建模方法对比表明本文所提出的模型具有较高的预测精度和计算效率.  相似文献   

8.
This paper, dealing with discrete production systems, has two goals. The first one is to identify cases in which the use of expert configurator software is a significant improvement for concurrent engineering achievement. Since the industrial implementation of these configurators is often tricky, the second issue of this paper is to present a model that allows to specify the industrial problem before implementation. This paper is divided into three sections. We first quickly recall the main trends of concurrent engineering and the related expert configurator interests. Then, the basic behavior and solutions for expert configurators are presented and the modeling requirements are pointed out. The third part is dedicated to the presentation, illustrated with an example, of our model and our method.  相似文献   

9.
Many pattern classification algorithms such as Support Vector Machines (SVMs), Multi-Layer Perceptrons (MLPs), and K-Nearest Neighbors (KNNs) require data to consist of purely numerical variables. However many real world data consist of both categorical and numerical variables. In this paper we suggest an effective method of converting the mixed data of categorical and numerical variables into data of purely numerical variables for binary classifications. Since the suggested method is based on the theory of learning Bayesian Network Classifiers (BNCs), it is computationally efficient and robust to noises and data losses. Also the suggested method is expected to extract sufficient information for estimating a minimum-error-rate (MER) classifier. Simulations on artificial data sets and real world data sets are conducted to demonstrate the competitiveness of the suggested method when the number of values in each categorical variable is large and BNCs accurately model the data.  相似文献   

10.
Monocular Vision for Mobile Robot Localization and Autonomous Navigation   总被引:5,自引:0,他引:5  
This paper presents a new real-time localization system for a mobile robot. We show that autonomous navigation is possible in outdoor situation with the use of a single camera and natural landmarks. To do that, we use a three step approach. In a learning step, the robot is manually guided on a path and a video sequence is recorded with a front looking camera. Then a structure from motion algorithm is used to build a 3D map from this learning sequence. Finally in the navigation step, the robot uses this map to compute its localization in real-time and it follows the learning path or a slightly different path if desired. The vision algorithms used for map building and localization are first detailed. Then a large part of the paper is dedicated to the experimental evaluation of the accuracy and robustness of our algorithms based on experimental data collected during two years in various environments.  相似文献   

11.
Hierarchical pixel bar charts   总被引:1,自引:0,他引:1  
Simple presentation graphics are intuitive and easy-to-use, but only show highly aggregated data. Bar charts, for example, only show a rather small number of data values and x-y-plots often have a high degree of overlap. Presentation techniques are often chosen depending on the considered data type, bar charts, for example, are used for categorical data and x-y plots are used for numerical data. We propose a combination of traditional bar charts and x-y-plots, which allows the visualization of large amounts of data with categorical and numerical data. The categorical data dimensions are used for the partitioning into the bars and the numerical data dimensions are used for the ordering arrangement within the bars. The basic idea is to use the pixels within the bars to present the detailed information of the data records. Our so-called pixel bar charts retain the intuitiveness of traditional bar charts while applying the principle of x-y charts within the bars. In many applications, a natural hierarchy is defined on the categorical data dimensions such as time, region, or product type. In hierarchical pixel bar charts, the hierarchy is exploited to split the bars for selected portions of the hierarchy. Our application to a number of real-world e-business and Web services data sets shows the wide applicability and usefulness of our new idea.  相似文献   

12.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

13.
We present a novel approach to estimating depth from single omnidirectional camera images by learning the relationship between visual features and range measurements available during a training phase. Our model not only yields the most likely distance to obstacles in all directions, but also the predictive uncertainties for these estimates. This information can be utilized by a mobile robot to build an occupancy grid map of the environment or to avoid obstacles during exploration—tasks that typically require dedicated proximity sensors such as laser range finders or sonars. We show in this paper how an omnidirectional camera can be used as an alternative to such range sensors. As the learning engine, we apply Gaussian processes, a nonparametric approach to function regression, as well as a recently developed extension for dealing with input-dependent noise. In practical experiments carried out in different indoor environments with a mobile robot equipped with an omnidirectional camera system, we demonstrate that our system is able to estimate range with an accuracy comparable to that of dedicated sensors based on sonar or infrared light.  相似文献   

14.
While it is quite typical to deal with attributes of different data types in the visualization of heterogeneous and multivariate datasets, most existing techniques still focus on the most usual data types such as numerical attributes or strings. In this paper we present a new approach to the interactive visual exploration and analysis of data that contains attributes which are of set type. A set-typed attribute of a data item--like one cell in a table--has a list of n > or = 0 elements as its value. We present the set'o'gram as a new visualization approach to represent data of set type and to enable interactive visual exploration and analysis. We also demonstrate how this approach is capable to help in dealing with datasets that have a larger number of dimensions (more than a dozen or more), especially also in the context of categorical data. To illustrate the effectiveness of our approach, we present the interactive visual analysis of a CRM dataset with data from a questionnaire on the education and shopping habits of about 90000 people.  相似文献   

15.
Multiregression is one of the most common approaches used to discover dependency pattern among attributes in a database. Nonadditive set functions have been applied to deal with the interactive predictive attributes involved, and some nonlinear integrals with respect to nonadditive set functions are employed to establish a nonlinear multiregression model describing the relation between the objective attribute and predictive attributes. The values of the nonadditive set function play a role of unknown regression coefficients in the model and are determined by an adaptive genetic algorithm from the data of predictive and objective attributes. Furthermore, such a model is now improved by a new numericalization technique such that the model can accommodate both categorical and continuous numerical attributes. The traditional dummy binary method dealing with the mixed type data can be regarded as a very special case of our model when there is no interaction among the predictive attributes and the Choquet integral is used. When running the algorithm, to avoid a premature during the evolutionary procedure, a technique of maintaining diversity in the population is adopted. A test example shows that the algorithm and the relevant program have a good reversibility for the data. © 2001 John Wiley & Sons, Inc.16: 949–962 (2001)  相似文献   

16.
We propose a model for a point-referenced spatially correlated ordered categorical response and methodology for inference. Models and methods for spatially correlated continuous response data are widespread, but models for spatially correlated categorical data, and especially ordered multi-category data, are less developed. Bayesian models and methodology have been proposed for the analysis of independent and clustered ordered categorical data, and also for binary and count point-referenced spatial data. We combine and extend these methods to describe a Bayesian model for point-referenced (as opposed to lattice) spatially correlated ordered categorical data. We include simulation results and show that our model offers superior predictive performance as compared to a non-spatial cumulative probit model and a more standard Bayesian generalized linear spatial model. We demonstrate the usefulness of our model in a real-world example to predict ordered categories describing stream health within the state of Maryland.  相似文献   

17.
Removing Multiplicative Noise by Douglas-Rachford Splitting Methods   总被引:1,自引:0,他引:1  
In this paper, we consider a variational restoration model consisting of the I-divergence as data fitting term and the total variation semi-norm or nonlocal means as regularizer for removing multiplicative Gamma noise. Although the I-divergence is the typical data fitting term when dealing with Poisson noise we substantiate why it is also appropriate for cleaning Gamma noise. We propose to compute the minimizers of our restoration functionals by applying Douglas-Rachford splitting techniques, resp. alternating direction methods of multipliers. For a particular splitting, we present a semi-implicit scheme to solve the involved nonlinear systems of equations and prove its Q-linear convergence. Finally, we demonstrate the performance of our methods by numerical examples.  相似文献   

18.
There is significant interest in the data mining and network management communities about the need to improve existing techniques for clustering multivariate network traffic flow records so that we can quickly infer underlying traffic patterns. In this paper, we investigate the use of clustering techniques to identify interesting traffic patterns from network traffic data in an efficient manner. We develop a framework to deal with mixed type attributes including numerical, categorical, and hierarchical attributes for a one-pass hierarchical clustering algorithm. We demonstrate the improved accuracy and efficiency of our approach in comparison to previous work on clustering network traffic.  相似文献   

19.
任务执行时间估计是云数据中心环境下工作流调度的前提.针对现有工作流任务执行时间预测方法缺乏类别型和数值型数据特征的有效提取问题,提出了基于多维度特征融合的预测方法.首先,通过构建具有注意力机制的堆叠残差循环网络,将类别型数据从高维稀疏的特征空间映射到低维稠密的特征空间,以增强类别型数据的解析能力,有效提取类别型特征;其次,采用极限梯度提升算法对数值型数据进行离散化编码,通过对稠密空间的输入向量进行稀疏化处理,提高了数值型特征的非线性表达能力;在此基础上,设计多维异质特征融合策略,将所提取的类别型、数值型特征与样本的原始输入特征进行融合,建立基于多维融合特征的预测模型,实现了云工作流任务执行时间的精准预测;最后,在真实云数据中心集群数据集上进行了仿真实验.实验结果表明,相对于已有的基准算法,该方法具有较高的预测精度,可用于大数据驱动的云工作流任务执行时间预测.  相似文献   

20.
The fuzzy min–max neural network classifier is a supervised learning method. This classifier takes the hybrid neural networks and fuzzy systems approach. All input variables in the network are required to correspond to continuously valued variables, and this can be a significant constraint in many real-world situations where there are not only quantitative but also categorical data. The usual way of dealing with this type of variables is to replace the categorical by numerical values and treat them as if they were continuously valued. But this method, implicitly defines a possibly unsuitable metric for the categories. A number of different procedures have been proposed to tackle the problem. In this article, we present a new method. The procedure extends the fuzzy min–max neural network input to categorical variables by introducing new fuzzy sets, a new operation, and a new architecture. This provides for greater flexibility and wider application. The proposed method is then applied to missing data imputation in voting intention polls. The micro data—the set of the respondents’ individual answers to the questions—of this type of poll are especially suited for evaluating the method since they include a large number of numerical and categorical attributes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号