首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Class Noise vs. Attribute Noise: A Quantitative Study   总被引:2,自引:0,他引:2  
Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.  相似文献   

2.
面向信息检索需要的网络数据清理研究   总被引:2,自引:0,他引:2  
Web数据中的质量参差不齐、可信度不高以及冗余现象造成了网络信息检索工具存储和运算资源的极大浪费,并直接影响着检索性能的提高。现有的网络数据清理方式并非专门针对网络信息检索的需要,因而存在着较大不足。本文根据对检索用户的查询行为分析,提出了一种利用查询无关特征分析和先验知识学习的方法计算页面成为检索结果页面的概率,从而进行网络数据清理的算法。基于文本信息检索会议标准测试平台的实验结果证明,此算法可以在保留近95%检索结果页面的基础上清理占语料库页面总数45%以上的低质量页面,这意味着使用更少的存储和运算资源获取更高的检索性能将成为可能。  相似文献   

3.
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem   总被引:29,自引:0,他引:29  
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent equational theory that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.  相似文献   

4.
This paper describes the result of our study on neural learning to solve the classification problems in which data is unbalanced and noisy. We conducted the study on three different neural network architectures, multi-layered Back Propagation, Radial Basis Function, and Fuzzy ARTMAP using three different training methods, duplicating minority class examples, Snowball technique and multidimensional Gaussian modeling of data noise. Three major issues are addressed: neural learning from unbalanced data examples, neural learning from noisy data, and making intentional biased decisions. We argue that by properly generated extra training data examples around the noise densities, we can train a neural network that has a stronger capability of generalization and better control of the classification error of the trained neural network. In particular, we focus on problems that require a neural network to make favorable classification to a particular class such as classifying normal(pass)/abnormal(fail) vehicles in an assembly plant. In addition, we present three methods that quantitatively measure the noise level of a given data set. All experiments were conducted using data examples downloaded directly from test sites of an automobile assembly plant. The experimental results showed that the proposed multidimensional Gaussian noise modeling algorithm was very effective in generating extra data examples that can be used to train a neural network to make favorable decisions for the minority class and to have increased generalization capability.  相似文献   

5.
王长宝  李青雯  于化龙 《计算机科学》2017,44(12):221-226, 254
针对在样本类别分布不平衡场景下,现有的主动学习算法普遍失效及训练时间过长等问题,提出采用建模速度更快的极限学习机,即ELM(Extreme Learning Machine)作为主动学习的基分类器,并以加权ELM算法用于主动学习过程的平衡控制,进而在理论上推导了其在线学习的过程,大幅降低了主动学习的时间开销,并将最终的混合算法命名为AOW-ELM算法。通过12个基准的二类不平衡数据集验证了该算法的有效性与可行性。  相似文献   

6.
王强  关毅  王晓龙 《自动化学报》2007,33(8):809-816
提出一种应用文本特征的类别属性进行文本分类过程中的类别噪声裁剪 (Eliminating class noise, ECN) 的算法. 算法通过分析文本关键特征中蕴含的类别指示信息, 主动预测待分类文本可能归属的类别集, 从而减少参与决策的分类器数目, 降低分类延迟,提高分类精度. 在中、英文测试语料上的实验表明, 该算法的 F 值分别达到 0.76 与 0.93, 而且分类器运行效率也有明显提升, 整体性能较好. 进一步的实验表明,此算法的扩展性能较好, 结合一定的反馈学习策略, 分类性能可进一步提高, 其 F 值可达到 0.806 与 0.943.  相似文献   

7.
考虑局部均值和类全局信息的快速近邻原型选择算法   总被引:1,自引:0,他引:1  
李娟  王宇平 《自动化学报》2014,40(6):1116-1125
压缩近邻法是一种简单的非参数原型选择算法,其原型选取易受样本读取序列、异常样本等干扰.为克服上述问题,提出了一个基于局部均值与类全局信息的近邻原型选择方法.该方法既在原型选取过程中,充分利用了待学习样本在原型集中k个同异类近邻局部均值和类全局信息的知识,又设定原型集更新策略实现对原型集的动态更新.该方法不仅能较好克服读取序列、异常样本对原型选取的影响,降低了原型集规模,而且在保持高分类精度的同时,实现了对数据集的高压缩效应.图像识别及UCI(University of California Irvine)基准数据集实验结果表明,所提出算法集具有较比较算法更有效的分类性能.  相似文献   

8.
The problem of downscaling the effects of global scale climate variability into predictions of local hydrology has important implications for water resource management. Our research aims to identify predictive relationships that can be used to integrate solar and ocean-atmospheric conditions into forecasts of regional water flows. In recent work we have developed an induction technique called second-order table compression, in which learning can be viewed as a process that transforms a table consisting of training data into a second-order table (which has sets of atomic values as entries) with fewer rows by merging rows in consistency preserving ways. Here, we apply the second-order table compression technique to generate predictive models of future water inflows of Lake Okeechobee, a primary source of water supply for south Florida. We also describe SORCER, a second-order table compression learning system and compare its performance with three well-established data mining techniques: neural networks, decision tree learning and associational rule mining. SORCER gives more accurate results, on the average, than the other methods with average accuracy between 49% and 56% in the prediction of inflows discretized into four ranges. We discuss the implications of these results and the practical issues in assessing the results from data mining models to guide decision-making.  相似文献   

9.
近年来,基于机器学习的数据分析和数据发布技术成为热点研究方向。与传统数据分析技术相比,机器学习的优点是能够精准分析大数据的结构与模式。但是,基于机器学习的数据分析技术的隐私安全问题日益突出,机器学习模型泄漏用户训练集中的隐私信息的事件频频发生,比如成员推断攻击泄漏机器学习中训练的存在与否,成员属性攻击泄漏机器学习模型训练集的隐私属性信息。差分隐私作为传统数据隐私保护的常用技术,正在试图融入机器学习以保护用户隐私安全。然而,对隐私安全、机器学习以及机器学习攻击三种技术的交叉研究较为少见。本文做了以下几个方面的研究:第一,调研分析差分隐私技术的发展历程,包括常见类型的定义、性质以及实现机制等,并举例说明差分隐私的多个实现机制的应用场景。初次之外,还详细讨论了最新的Rényi差分隐私定义和Moment Accountant差分隐私的累加技术。其二,本文详细总结了机器学习领域常见隐私威胁模型定义、隐私安全攻击实例方式以及差分隐私技术对各种隐私安全攻击的抵抗效果。其三,以机器学习较为常见的鉴别模型和生成模型为例,阐述了差分隐私技术如何应用于保护机器学习模型的技术,包括差分隐私的随机梯度扰动(DP-SGD)技术和差分隐私的知识转移(PATE)技术。最后,本文讨论了面向机器学习的差分隐私机制的若干研究方向及问题。  相似文献   

10.
Conventional classification algorithms are not well suited for the inherent uncertainty, potential concept drift, volume, and velocity of streaming data. Specialized algorithms are needed to obtain e?c...  相似文献   

11.
This paper proposes to apply machine learning techniques to predict students’ performance on two real-world educational data-sets. The first data-set is used to predict the response of students with autism while they learn a specific task, whereas the second one is used to predict students’ failure at a secondary school. The two data-sets suffer from two major problems that can negatively impact the ability of classification models to predict the correct label; class imbalance and class noise. A series of experiments have been carried out to improve the quality of training data, and hence improve prediction results. In this paper, we propose two noise filter methods to eliminate the noisy instances from the majority class located inside the borderline area. Our methods combine the over-sampling SMOTE technique with the thresholding technique to balance the training data and choose the best boundary between classes. Then we apply a noise detection approach to identify the noisy instances. We have used the two data-sets to assess the efficacy of class-imbalance approaches as well as both proposed methods. Results for different classifiers show that, the AUC scores significantly improved when the two proposed methods combined with existing class-imbalance techniques.  相似文献   

12.
排序学习是当前信息检索领域研究热点之一。为了避免训练集中噪音的影响,当前排序学习算法较多关注鲁棒性。已有的工作发现相同的排序学习方法的性能在不同的数据集上会有截然不同的噪音敏感度。模型改变是导致性能下降的直接原因,而模型又是从训练集学习到的,因此根源在于训练数据的某些特性。该文根据具体排序学习场景分析得出影响噪音敏感度的根本原因在于训练集中文档对分布的结论,并在LETOR3.0上的实验验证了这一结论。  相似文献   

13.
数据结构可视化类库的设计与实现   总被引:4,自引:0,他引:4  
苏莹  吴伟民 《微机发展》2006,16(5):61-64
本工作室开发的数据结构可视化类库(JVDSCL,Visual Data Structures Class Library in Java)将可视化技术引入数据结构类库,实现了数据结构可视化。介绍了对数据结构类的可视化扩充方法,给出了实现各种数据结构可视化布局算法的基本框架。JVDSCL可以应用在程序调试和软件开发,提高软件的可视性、重用性与开发效率。  相似文献   

14.
Data Mining in Large Databases Using Domain Generalization Graphs   总被引:5,自引:0,他引:5  
Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.  相似文献   

15.
高鹏  曹先彬 《计算机工程》2008,34(5):166-168
聊天室中的聊天数据充斥着大量噪声,极大地降低了话题检测的监控效率。但聊天数据只有对话发出时间这一线索可供直接利用,因此噪声过滤是聊天室监控的一个难题。该文提出一种基于社会网络的聊天数据噪声过滤方法,通过分析聊天数据的时序关系,推断出聊天用户间的社会网络关系,根据社会网络蕴含的用户交流特点判断并过滤出噪声。实验证实了该方法能较准确地过滤出噪声,提高话题识别的准确率。  相似文献   

16.
张战成  王士同  钟富礼 《自动化学报》2011,37(10):1256-1263
提出了一种协作式整体局部分类算法,即C2M (Collaborative classification machine with local and global information),该算法利用两类样本各自的协方差作为整体方向信息, 获得两个带整体和局部信息的分类面,并通过组合分类器的平均规则将两个分类面组合, 得到最终的最优判决平面.该算法可用两次QP (Quadratic programming)求解,时间复杂度为O(2N3), 大大小于M4 (Maxi-min margin machine)的O(N4), 线性核时的分类精度高于只利用了局部信息的支持向量机 (Support vector machine, SVM).理论上证明了在交遇区较多时, C2M 可以比M4 更有效地利用全局信息,并提出了判断整体信息对分类是否有贡献的4个判别指标. 模拟数据和标准数据集上与M4 和SVM的对比实验证明了该算法的有效性.  相似文献   

17.
Big data introduces challenges to query answering, from theory to practice. A number of questions arise. What queries are "tractable" on big data? How can we make big data "small" so that it is feasible to find exact query answers?When exact answers are beyond reach in practice, what approximation theory can help us strike a balance between the quality of approximate query answers and the costs of computing such answers? To get sensible query answers in big data,what else do we necessarily do in addition to coping with the size of the data? This position paper aims to provide an overview of recent advances in the study of querying big data. We propose approaches to tackling these challenging issues,and identify open problems for future research.  相似文献   

18.
杨超  何静静 《计算机工程》2008,34(11):268-269
评价数据易被噪声污染,导致评价结论失真。该文提出一种基于数据密度的噪声清洗(DNC)算法,用于识别并过滤评价数据中的噪声,介绍了一套仿真实验方案。实验结果表明,DNC算法可以提高噪声的识别和过滤效果。该算法在数据管理领域具有应用价值。  相似文献   

19.
区块链具有不可篡改性和去中心化的特点,其与联邦学习的结合成为人工智能领域的热门主题。目前去中心化联邦学习存在训练数据非独立同分布导致的性能下降问题,为了解决这个问题,提出一种模型相似度的计算方法,然后设计一种基于该模型相似度的去中心化联邦学习策略,并使用五个联邦学习任务进行测试,分别是CNN模型训练fashion-mnist数据集、alexnet模型训练cifar10数据集、TextRnn模型训练THUsnews数据集、Resnet18模型训练SVHN数据集和LSTM模型训练sentiment140数据集。实验结果表明,设计的策略在五个任务非独立同分布的数据下进行去中心化联邦学习,准确率分别提升了2.51、5.16、17.58、2.46和5.23个百分点。  相似文献   

20.
基于小波变换和数据融合技术的图像降噪方法   总被引:3,自引:0,他引:3  
提出了一种基于小波变换和数据融合技术的图像降噪的方法.此方法对同一原始图像信号不同噪声的多源图像分别进行小波分解,在图像分解的高频域内,对小波系数进行阈值处理后,再进行数据融合处理,根据“多数原则”选择重要小波系数.在低频域内,新的逼近系数则通过对多幅图像的逼近系数直接进行加权平均得到.然后利用重要小波系数和逼近系数进行小波反变换,即可得到融合后的图像.实验结果表明:此方法既可以有效地降低噪声,又可以较好地保持图像细节.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号