首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.  相似文献   

2.
聚类分析中利用有效性指标判断数据集的正确类数极易受到噪声数据、类之间分离性以及聚类算法的影响,所确定类数的正确性难以得到保证.为克服这个问题,以文献[1]中的数据约减方法为基础,对原数据集和约减后的数据集利用有效性指标进行正确类数判别.实验表明:该方法能增大类之间的分离性,有效判断数据集的最优类数.  相似文献   

3.
This paper studies the problem of point cloud simplification by searching for a subset of the original input data set according to a specified data reduction ratio (desired number of points). The unique feature of the proposed approach is that it aims at minimizing the geometric deviation between the input and simplified data sets. The underlying simplification principle is based on clustering of the input data set. The cluster representation essentially partitions the input data set into a fixed number of point clusters and each cluster is represented by a single representative point. The set of the representatives is then considered as the simplified data set and the resulting geometric deviation is evaluated against the input data set on a cluster-by-cluster basis. Due to the fact that the change to a representative selection only affects the configuration of a few neighboring clusters, an efficient scheme is employed to update the overall geometric deviation during the search process. The search involves two interrelated steps. It first focuses on a good layout of the clusters and then on fine tuning the local composition of each cluster. The effectiveness and performance of the proposed approach are validated and illustrated through case studies using synthetic as well as practical data sets.  相似文献   

4.
对高维数据集合的维数消减方法及其应用进行了分类研究.将数据维数消减方法主要分为两类:子集选择法和数据变换法.基于统计数学和现有的数据挖掘模型,给出了这两类中的一些典型的维数消减方法,并对这些方法的主要特性和有效应用进行了分析、探讨,给出了一些可行的方法实现策略.  相似文献   

5.
The present paper deals with formal concept analysis of data with fuzzy attributes. We clarify several points of a new approach of [S.Q. Fan, W.X. Zhang, Variable threshold concept lattice, Inf. Sci., accepted for publication] which is based on using thresholds in concept-forming operators. We show that the extent- and intent-forming operators from [S.Q. Fan, W.X. Zhang, Inf. Sci., accepted for publication] can be defined in terms of basic fuzzy set operations and the original operators as introduced and studied e.g. in [R. Belohlavek, Fuzzy Galois connections, Math. Logic Quarterly 45 (4) (1999) 497-504; R. Belohlavek, Concept lattices and order in fuzzy logic, Ann. Pure Appl. Logic 128 (2004) 277-298; S. Pollandt, Fuzzy Begriffe, Springer-Verlag, Berlin/Heidelberg, 1997]. As a consequence, main properties of the new operators from [S.Q. Fan, W.X. Zhang, Inf. Sci., accepted for publication], including the properties studied in [S.Q. Fan, W.X. Zhang, Inf. Sci., accepted for publication], can be obtained as consequences of the original operators from [R. Belohlavek, 1999; R. Belohlavek, 2004; S. Pollandt, 1997].  相似文献   

6.
论文构造了一个实时多媒体数据挖掘模型,提出了一种原始视频数据进行数据挖掘的新机制,主要采用分层向量距离来进行动态可控序列分析分割、段内特征提取,使用粒子群K均值进行段间聚集,在一定程度上,解决了多媒体数据挖掘各方面的特殊要求。挖掘模型中各个部分与提出的技术相结合,基本上能满足实时情况下处理原始视频数据的要求。  相似文献   

7.
8.
数据约简是包括数据压缩、数据调整和特征提取在内的数据挖掘技术中的重要课题,但已有的数据约简方法主要聚焦在特征或者维度的约简,而针对样本个数的约简方法,往往是针对具体的数据集开发,缺乏一般性.针对数据集中数据分布的一般特征,定义一种新的基于张开角的测度.该测度能够区分数据集中核心对象和边界对象分布的本质区别,实现数据集中以核心对象为中心的数据压缩.通过对UCI公共测试平台上20个具有不同特征的典型样本集进行数据约简和测试,结果表明:约简能够有效地提取数据集中的核心目标;通过对约简前后数据集采用经典K均值算法聚类,发现约简后数据集中聚类正确率明显高于约简前数据集.  相似文献   

9.
Crisp input and output data are fundamentally indispensable in traditional data envelopment analysis (DEA). However, the input and output data in real-world problems are often imprecise or ambiguous. Some researchers have proposed interval DEA (IDEA) and fuzzy DEA (FDEA) to deal with imprecise and ambiguous data in DEA. Nevertheless, many real-life problems use linguistic data that cannot be used as interval data and a large number of input variables in fuzzy logic could result in a significant number of rules that are needed to specify a dynamic model. In this paper, we propose an adaptation of the standard DEA under conditions of uncertainty. The proposed approach is based on a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set. Our robust DEA (RDEA) model seeks to maximize efficiency (similar to standard DEA) but under the assumption of a worst case efficiency defied by the uncertainty set and it’s supporting constraint. A Monte-Carlo simulation is used to compute the conformity of the rankings in the RDEA model. The contribution of this paper is fourfold: (1) we consider ambiguous, uncertain and imprecise input and output data in DEA; (2) we address the gap in the imprecise DEA literature for problems not suitable or difficult to model with interval or fuzzy representations; (3) we propose a robust optimization model in which the input and output parameters are constrained to be within an uncertainty set with additional constraints based on the worst case solution with respect to the uncertainty set; and (4) we use Monte-Carlo simulation to specify a range of Gamma in which the rankings of the DMUs occur with high probability.  相似文献   

10.
Conceptual design plays an important role in development of new products and redesign of existing products. Morphological matrix is a popular tool for conceptual design. Although the morphological-matrix based conceptual design approaches are effective for generation of conceptual schemes, quantitative evaluation to each of the function solution principle is seldom considered, thus leading to the difficulty to identify the optimal conceptual design by combining these function solution principles. In addition, the uncertainties due to the subjective evaluations from engineers and customers in early design stage are not considered in these morphological-matrix based conceptual design approaches. To solve these problems, a systematic decision making approach is developed in this research for product conceptual design based on fuzzy morphological matrix to quantitatively evaluate function solution principles using knowledge and preferences of engineers and customers with subjective uncertainties. In this research, the morphological matrix is quantified by associating the properties of function solution principles with the information of customer preferences and product failures. Customer preferences for different function solution principles are obtained from multiple customers using fuzzy pairwise comparison (FPC). The fuzzy customer preference degree of each solution principle is then calculated by fuzzy logarithmic least square method (FLLSM). In addition, the product failure data are used to improve product reliability through fuzzy failure mode effects analysis (FMEA). Unlike the traditional FMEA, the causality relationships among failure modes of solution principles are analyzed to use failure information more effectively through constructing a directed failure causality relationship diagram (DFCRD). A fuzzy multi-objective optimization model is also developed to solve the conceptual design problem. The effectiveness of this new approach is demonstrated using a real-world application for conceptual design of a horizontal directional drilling machine (HDDM).  相似文献   

11.
12.
Vertices Principal Component Analysis (V-PCA), and Centers Principal Component Analysis (C-PCA) generalize Principal Component Analysis (PCA) in order to summarize interval valued data. Neural Network Principal Component Analysis (NN-PCA) represents an extension of PCA for fuzzy interval data. However, also the first two methods can be used for analyzing fuzzy interval data, but they then ignore the spread information. In the literature, the V-PCA method is usually considered computationally cumbersome because it requires the transformation of the interval valued data matrix into a single valued data matrix the number of rows of which depends exponentially on the number of variables and linearly on the number of observation units. However, it has been shown that this problem can be overcome by considering the cross-products matrix which is easy to compute. A review of C-PCA and V-PCA (which hence also includes the computational short-cut to V-PCA) and NN-PCA is provided. Furthermore, a comparison is given of the three methods by means of a simulation study and by an application to an empirical data set. In the simulation study, fuzzy interval data are generated according to various models, and it is reported in which conditions each method performs best.  相似文献   

13.
在面向大规模复杂数据的模式分类和识别问题中,绝大多数的分类器都遇到了维数灾难这一棘手的问题.在进行高维数据分类之前,基于监督流形学习的非线性降维方法可提供一种有效的解决方法.利用多项式逻辑斯蒂回归方法进行分类预测,并结合基于非线性降维的非监督流形学习方法解决图像以及非图像数据的分类问题,因而形成了一种新的分类识别方法.大量的实验测试和比较分析验证了本文所提方法的优越性.  相似文献   

14.
在面向大规模复杂数据的模式分类和识别问题中,绝大多数的分类器都遇到了维数灾难这一棘手的问题.在进行高维数据分类之前,基于监督流形学习的非线性降维方法可提供一种有效的解决方法.利用多项式逻辑斯蒂回归方法进行分类预测,并结合基于非线性降维的非监督流形学习方法解决图像以及非图像数据的分类问题,因而形成了一种新的分类识别方法.大量的实验测试和比较分析验证了本文所提方法的优越性.  相似文献   

15.
随着用户会话的增多,基于用户会话的Web测试方法的测试集会越来越庞大,为了解决测试开销受到挑战的问题,使用概念分析方法对测试用例分簇,构建层次概念网络,结合贪心算法,提出了3种Web应用程序测试集的简化方法,扩展了测试集的动态简化,并且提出了移除过时用户会话的测试集动态简化方法.最后,实验结果表明了该方法在语句覆盖和故障检测方面的有效性.  相似文献   

16.
A new scheme, incorporating dimensionality reduction and clustering, suitable for classification of a large volume of remotely sensed data using a small amount of memory is proposed. The scheme involves transforming the data from multidimensional n-space to a 3-dimensional primary color space of blue, green and red coordinates. The dimensionality reduction is followed by data reduction, which involves assigning 3-dimensional samples to a 2-dimensional array. Finally, a multi-stage ISODATA technique incorporating a novel seedpoint picking method is used to obtain the desired number of clusters.

The storage requirements are reduced to a low value by making five passes through the data and storing necessary information during each pass. The first three passes are used to find the minimum and maximum values of some of the variables. The data reduction is done and a classification table is formed during the fourth pass. The classification map is obtained during the fifth pass. The computer memory required is about 2K machine words.

The efficacy of the algorithm is justified by simulation studies using multispectral LANDSAT data.  相似文献   


17.
The theory of concept lattices is an efficient tool for knowledge representation and knowledge discovery, and is applied to many fields successfully. One focus of knowledge discovery is knowledge reduction. Based on the reduction theory of classical formal context, this paper proposes the definition of decision formal context and its reduction theory, which extends the reduction theory of concept lattices. In this paper, strong consistence and weak consistence of decision formal context are defined respectively. For strongly consistent decision formal context, the judgment theorems of consistent sets are examined, and approaches to reduction are given. For weakly consistent decision formal context, implication mapping is defined, and its reduction is studied. Finally, the relation between reducts of weakly consistent decision formal context and reducts of implication mapping is discussed.  相似文献   

18.
一种基于双聚类的缺失数据填补方法   总被引:1,自引:0,他引:1  
针对现实数据集的数据缺失问题,提出了一种基于双聚类的缺失数据填补新方法.该算法利用双聚类簇内平均平方残值越小簇内数据相似性越高的这一特性,将缺失数据的填补问题转换为求解特定双聚类簇最小平均平方残值的问题,进而实现了数据集中缺失元素的预测;再利用二次函数求解极小值的思想对包含有缺失数据的特定双聚类簇最小平均平方残值的问题进行求解,并进行了数学上的分析证明.最后进行仿真验证,通过观察UCI数据集的实验结果可知,提出的算法具有较高的填补准确性.  相似文献   

19.
It is well known that processing big graph data can be costly on Cloud. Processing big graph data introduces complex and multiple iterations that raise challenges such as parallel memory bottlenecks, deadlocks, and inefficiency. To tackle the challenges, we propose a novel technique for effectively processing big graph data on Cloud. Specifically, the big data will be compressed with its spatiotemporal features on Cloud. By exploring spatial data correlation, we partition a graph data set into clusters. In a cluster, the workload can be shared by the inference based on time series similarity. By exploiting temporal correlation, in each time series or a single graph edge, temporal data compression is conducted. A novel data driven scheduling is also developed for data processing optimisation. The experiment results demonstrate that the spatiotemporal compression and scheduling achieve significant performance gains in terms of data size and data fidelity loss.  相似文献   

20.
To reduce the high dimensionality required for training of feature vectors in speaker identification, we propose an efficient GMM based on local PCA with fuzzy clustering. The proposed method firstly partitions the data space into several disjoint clusters by fuzzy clustering, and then performs PCA using the fuzzy covariance matrix on each cluster. Finally, the GMM for speaker is obtained from the transformed feature vectors with reduced dimension in each cluster. Compared to the conventional GMM with diagonal covariance matrix, the proposed method shows faster result with less storage maintaining same performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号