首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Exploratory data analysis methods are essential for getting insight into data. Identifying the most important variables and detecting quasi-homogenous groups of data are problems of interest in this context. Solving such problems is a difficult task, mainly due to the unsupervised nature of the underlying learning process. Unsupervised feature selection and unsupervised clustering can be successfully approached as optimization problems by means of global optimization heuristics if an appropriate objective function is considered. This paper introduces an objective function capable of efficiently guiding the search for significant features and simultaneously for the respective optimal partitions. Experiments conducted on complex synthetic data suggest that the function we propose is unbiased with respect to both the number of clusters and the number of features.  相似文献   

2.
Feature selection (attribute reduction) from large-scale incomplete data is a challenging problem in areas such as pattern recognition, machine learning and data mining. In rough set theory, feature selection from incomplete data aims to retain the discriminatory power of original features. To address this issue, many feature selection algorithms have been proposed, however, these algorithms are often computationally time-consuming. To overcome this shortcoming, we introduce in this paper a theoretic framework based on rough set theory, which is called positive approximation and can be used to accelerate a heuristic process for feature selection from incomplete data. As an application of the proposed accelerator, a general feature selection algorithm is designed. By integrating the accelerator into a heuristic algorithm, we obtain several modified representative heuristic feature selection algorithms in rough set theory. Experiments show that these modified algorithms outperform their original counterparts. It is worth noting that the performance of the modified algorithms becomes more visible when dealing with larger data sets.  相似文献   

3.
More than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough–granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established.  相似文献   

4.
基于粒度计算的特征选择方法   总被引:1,自引:0,他引:1  
从粒度计算的划分模型出发,重新定义了相容决策表的约简,并给出了一种新的基于粒度计算的属性约简算法.该算法以信息熵作为启发信息,通过逐渐增加属性构成条件属性集相对于决策属性的约简,再通过删除约简中的所有不必要属性,得到最小约简.该算法有效地降低了计算属性约简的时间复杂度,可以用于较大规模数据集的特征选择.在5个公开的基因表达数据集上的实验证明了该算法能找到高区分能力的特征子集.  相似文献   

5.
Tabu search for attribute reduction in rough set theory   总被引:2,自引:0,他引:2  
In this paper, we consider a memory-based heuristic of tabu search to solve the attribute reduction problem in rough set theory. The proposed method, called tabu search attribute reduction (TSAR), is a high-level TS with long-term memory. Therefore, TSAR invokes diversification and intensification search schemes besides the TS neighborhood search methodology. TSAR shows promising and competitive performance compared with some other CI tools in terms of solution qualities. Moreover, TSAR shows a superior performance in saving the computational costs.  相似文献   

6.
A grey-based rough approximation model for interval data processing   总被引:4,自引:0,他引:4  
A new rough set model for interval data named grey-rough set is proposed in this paper. Information system in the real world are quite complicated. Most of information tables record not only categorical data but also numerical data including a range of interval data. The grey lattice operation in grey system theory is one of the operations for interval data that modifies endpoints non-arithmetically, and which is useful for interval data processing. The grey-rough approximation is based on an interval coincidence relation and an interval inclusion relation instead of an equivalence relation and an indiscernibility relation in Pawlak’s model. Numerical examples and four fields of practical examples, decision-making, information retrieval, knowledge discovery and kansei engineering are shown. The advantages of the proposal include: extending a treatable value compared with classical rough set for non-deterministic information systems, providing a maximum solution and minimum solution both in upper and lower approximations, and not only providing mathematical support to SQL but also functions for further extension in the future.  相似文献   

7.
数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。  相似文献   

8.
In this paper, an unsupervised learning-based approach is presented for fusing bracketed exposures into high-quality images that avoids the need for interim conversion to intermediate high dynamic range (HDR) images. As an objective quality measure – the colored multi-exposure fusion structural similarity index measure (MEF-SSIMc) – is optimized to update the network parameters, the unsupervised learning can be realized without using any ground truth (GT) images. Furthermore, an unreferenced gradient fidelity term is added in the loss function to recover and supplement the image information for the fused image. As shown in the experiments, the proposed algorithm performs well in terms of structure, texture, and color. In particular, it maintains the order of variations in the original image brightness and suppresses edge blurring and halo effects, and it also produces good visual effects that have good quantitative evaluation indicators. Our code will be publicly available at https://github.com/cathying-cq/UMEF.  相似文献   

9.
Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: the automatic accumulation of massive bodies of knowledge.A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness?We present a combinatorial “balls-and-urns” model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.  相似文献   

10.
Wei-Zhi Wu 《Information Sciences》2011,181(18):3878-3897
Granular computing and acquisition of if-then rules are two basic issues in knowledge representation and data mining. A formal approach to granular computing with multi-scale data measured at different levels of granulations is proposed in this paper. The concept of labelled blocks determined by a surjective function is first introduced. Lower and upper label-block approximations of sets are then defined. Multi-scale granular labelled partitions and multi-scale decision granular labelled partitions as well as their derived rough set approximations are further formulated to analyze hierarchically structured data. Finally, the concept of multi-scale information tables in the context of rough set is proposed and the unravelling of decision rules at different scales in multi-scale decision tables is discussed.  相似文献   

11.
12.
基于元信息的粗糙集规则并行挖掘方法   总被引:1,自引:0,他引:1  
苏健  高济 《计算机科学》2003,30(3):35-39
1.引言在当前的信息化时代,为从大量积累的历史数据中获取有用的知识,使得数据挖掘已成为研究热点。Pawlak教授提出粗糙集合理论,经过众多学者的研究和完善,已成为数据挖掘的重要手段。在大数据环境下,数据挖掘方法的速度将直接影响整个数据挖掘系统的性能,如何有效地提高数据挖掘方法的速度,是迫切需要解决的问题。与此同时,计算机网络存在大量的运算资源,充分利用这些资源是提高数据挖掘方法速度的有效途径。为此,本文提出  相似文献   

13.
粒计算的α_决策逻辑语言   总被引:1,自引:0,他引:1       下载免费PDF全文
提出一种用于粒计算的α_决策逻辑语言.该语言是由Tarski意义下的模型和可满足性所描述的一种特殊的经典谓词逻辑.由属性值域的模糊子集代替经典的单值信息函数所得到的广义信息系统对应于模型,借助于模糊集理论的水平截集的概念,归纳地定义对泉在一定阈值水平下满足某公式.最后讨论如何利用α_决策逻辑语言描述不同的粒世界及分析形式概念和决策规则.  相似文献   

14.
由于成像机理不同,多源图像有本质区别,使得其在融合过程中存在差异.在参阅了大量中外文献的基础上,对融合方法进行分类,并重点论述了各类融合方法的融合过程和典型算法,详细阐述了其关键技术.同时,深入评述了当前的评价指标和分类.最后,结合关键技术的影响因素和技术的发展状况,从数据特征、时间效率、信息提取、评估角度和方法的普适...  相似文献   

15.
Data fusion in information retrieval has been investigated by many researchers and a number of data fusion methods have been proposed. However, problems such as why data fusion can increase effectiveness and favorable conditions for the use of data fusion methods are poorly resolved at best. In this paper, we formally describe data fusion under a geometric framework, in which each component result returned from an information retrieval system for a given query is represented as a point in a multi-dimensional space. The Euclidean distance is the measure by which the effectiveness and similarity of search results are judged. This allows us to explain all component results and fused results using geometrical principles. In such a framework, score-based data fusion becomes a deterministic problem. Several interesting features of the centroid-based data fusion method and the linear combination method are discussed. Nevertheless, in retrieval evaluation, ranking-based measures are the most popular. Therefore, this paper investigates the relation and correlation between the Euclidean distance and several typical ranking-based measures. We indeed find that a very strong correlation exists between these. It means that the theorems and observations obtained using the Euclidean distance remain valid when ranking-based measures are used. The proposed framework enables us to have a better understanding of score-based data fusion and use score-based data fusion methods more precisely and effectively in various ways.  相似文献   

16.
基于数据库的属性约简模型的快速求核算法   总被引:1,自引:0,他引:1       下载免费PDF全文
对于基于数据库系统的属性约简模型,给出相应的简化差别矩阵和相应核的定义,并证明该核与基于数据库系统的属性约简模型的核是等价的。在此基础上设计了一个新的求核算法,其时间复杂度和空间复杂度分别为max{O(|C||U/C|2),O(|C||U|)}和O(|U|)。  相似文献   

17.
针对小电流接地系统故障选线难的问题,提出了一种基于多源信息融合的小电流接地系统故障选线方法。该方法以可能发生故障的线路集合、各线路故障特征集合和各线路故障特征规定的量域作为可拓融合的物元三要素,以故障角的大小为权重,把在小故障角下选线准确的能量选线法和大故障角下选线准确的小波包选线法有效融合,并结合重合闸技术实现了准确选线。仿真结果验证了该方法的有效性和合理性。  相似文献   

18.
Classification of objects and background in a complex scene is a challenging early vision problem. Specifically, the problem is compounded under poor illumination conditions. In this paper, the problem of object and background classification has been addressed under unevenly illuminated conditions. The challenge is to extract the actual boundary under poorly illuminated portion. This has been addressed using the notion of rough sets and granular computing. In order to take care of differently illuminated portions over the image, adaptive window growing approach has been employed to partition the image into different windows and optimum threshold for classification over a given window has been determined by the four proposed granular computing based schemes. Over a particular window, illumination varies significantly posing a challenge for classification. In order to deal with these issues, we have proposed four schemes based on heterogeneous and non-homogeneous granulation. They are; (i) Heterogeneous Granulation based Window Growing (HGWG), (ii) Empirical Non-homogeneous Granulation based Window Growing (ENHWG), (iii) Fuzzy Gradient Non-homogeneous based Window Growing (FNHWG), (iv) Fuzzy Gradient Non-homogeneous Constrained Neighbourhood based Window Growing (FNHCNGWG). The proposed schemes have been tested with images from Berkeley image database, specifically with unevenly illuminated images having single and multiple objects. The performance of the proposed schemes has been evaluated based on four metrics. The performance of the FNHWG and FNHCNGWG schemes has been compared with Otsu, K-means, FCM, PCM, Pal’s method, HGWG, ENHWG and Fuzzy Non-homogeneous Neighbourhood based Window growing (FNHNGWG) schemes and found to be superior to the existing ones.  相似文献   

19.
黄琴    钱文彬    王映龙  吴兵龙 《智能系统学报》2019,14(5):929-938
在多标记学习中,特征选择是提升多标记学习分类性能的有效手段。针对多标记特征选择算法计算复杂度较大且未考虑到现实应用中数据的获取往往需要花费代价,本文提出了一种面向代价敏感数据的多标记特征选择算法。该算法利用信息熵分析特征与标记之间的相关性,重新定义了一种基于测试代价的特征重要度准则,并根据服从正态分布的特征重要度和特征代价的标准差,给出一种合理的阈值选择方法,同时通过阈值剔除冗余和不相关特征,得到低总代价的特征子集。通过在多标记数据的实验对比和分析,表明该方法的有效性和可行性。  相似文献   

20.
张锐  肖如良  倪友聪  杜欣 《计算机应用》2017,37(9):2684-2688
针对数据仿真过程中表格数据属性间关联难的问题,提出一种刻画表格数据中非时间属性间关联特征的H模型。首先,从数据集中提取评价主体和被评价主体关键属性,进行两重频数统计,得到关于关键属性的4个关系对;然后,计算各关系对的最大信息系数(MIC)来评估各关系对的相关性,并采用拉伸指数分布(SE)对各关系对进行关系拟合;最后,设置评价主体和被评价主体的数据规模,根据拟合出的关系计算出评价主体的活跃度和被评价主体的流行度,通过活跃度总和等于流行度总和建立关联,得到非时间属性关联的H模型。实验结果表明,利用H模型能有效地刻画真实数据集中非时间属性间的关联特征。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号