首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
Mining class outliers: concepts, algorithms and applications in CRM   总被引:4,自引:0,他引:4  
Outliers, or commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications and has attracted much attention from the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem ‘given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels’. By generalizing two pioneer contributions [Proc WAIM02 (2002); Proc SSTD03] in this field, we develop the notion of class outlier and propose practical solutions by extending existing outlier detection algorithms to this case. Furthermore, its potential applications in CRM (customer relationship management) are also discussed. Finally, the experiments in real datasets show that our method can find interesting outliers and is of practical use.  相似文献   

2.
Abstract. For some multimedia applications, it has been found that domain objects cannot be represented as feature vectors in a multidimensional space. Instead, pair-wise distances between data objects are the only input. To support content-based retrieval, one approach maps each object to a k-dimensional (k-d) point and tries to preserve the distances among the points. Then, existing spatial access index methods such as the R-trees and KD-trees can support fast searching on the resulting k-d points. However, information loss is inevitable with such an approach since the distances between data objects can only be preserved to a certain extent. Here we investigate the use of a distance-based indexing method. In particular, we apply the vantage point tree (vp-tree) method. There are two important problems for the vp-tree method that warrant further investigation, the n-nearest neighbors search and the updating mechanisms. We study an n-nearest neighbors search algorithm for the vp-tree, which is shown by experiments to scale up well with the size of the dataset and the desired number of nearest neighbors, n. Experiments also show that the searching in the vp-tree is more efficient than that for the -tree and the M-tree. Next, we propose solutions for the update problem for the vp-tree, and show by experiments that the algorithms are efficient and effective. Finally, we investigate the problem of selecting vantage-point, propose a few alternative methods, and study their impact on the number of distance computation. Received June 9, 1998 / Accepted January 31, 2000  相似文献   

3.
With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms. Received April 28, 2000 / Accepted July 11, 2000  相似文献   

4.
针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI) 算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。  相似文献   

5.
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.  相似文献   

6.
基于距离的孤立点检测研究   总被引:15,自引:0,他引:15  
孤立点检测是一个重要的知识发现任务,在分析基于距离的孤立点及其检测算法的基础上,文章提出了一个判定孤立点的新定义,并设计了基于抽样的近似检测算法,用实际数据进行了实验。实验结果表明,新的定义不仅与DB(p,d)孤立点定义有着相同的结果,而且简化了孤立点检测对用户的要求,同时给出了数据对象在数据集中的孤立程度。  相似文献   

7.
Finding the rare instances or the outliers is important in many KDD (knowledge discovery and data-mining) applications, such as detecting credit card fraud or finding irregularities in gene expressions. Signal-processing techniques have been introduced to transform images for enhancement, filtering, restoration, analysis, and reconstruction. In this paper, we present a new method in which we apply signal-processing techniques to solve important problems in data mining. In particular, we introduce a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. The main idea in FindOut is to remove the clusters from the original data and then identify the outliers. Although previous research showed that such techniques may not be effective because of the nature of the clustering, FindOut can successfully identify outliers from large datasets. Experimental results on very large datasets are presented which show the efficiency and effectiveness of the proposed approach. Received 7 September 2000 / Revised 2 February 2001 / Accepted in revised form 31 May 2001 Correspondence and offprint requests to: A. Zhang, Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260, USA. Email: azhang@cse.buffalo.eduau  相似文献   

8.
Anomaly detection is considered an important data mining task, aiming at the discovery of elements (known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria leads to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility in the sense that they assume a single configuration of the k and R parameters. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters. Experiments performed on real-life and synthetic data sets verify our theoretical study.  相似文献   

9.
离群点检测算法在网络入侵检测、医疗辅助诊断等领域具有十分广泛的应用。针对LDOF、CBOF及LOF算法在大规模数据集和高维数据集的检测过程中存在的执行时间长及检测率较低的问题,提出了基于图上随机游走(BGRW)的离群点检测算法。首先初始化迭代次数、阻尼因子以及数据集中每个对象的离群值;其次根据对象之间的欧氏距离推导出漫步者在各对象之间的转移概率;然后通过迭代计算得到数据集中每个对象的离群值;最后将数据集中离群值最高的对象判定为离群点并输出。在UCI真实数据集与复杂分布的合成数据集上进行实验,将BGRW算法与LDOF、CBOF和LOF算法在执行时间、检测率和误报率指标上进行对比。实验结果表明,BGRW算法能够有效降低执行时间并在检测率及误报率指标上优于对比算法。  相似文献   

10.
Summary. In a shared-memory distributed system, n independent asynchronous processes communicate by reading and writing to shared variables. An algorithm is adaptive (to total contention) if its step complexity depends only on the actual number, k, of active processes in the execution; this number is unknown in advance and may change in different executions of the algorithm. Adaptive algorithms are inherently wait-free, providing fault-tolerance in the presence of an arbitrary number of crash failures and different processes' speed. A wait-free adaptive collect algorithm with O(k) step complexity is presented, together with its applications in wait-free adaptive alg orithms for atomic snapshots, immediate snapshots and renaming. Received: August 1999 / Accepted: August 2001  相似文献   

11.
针对传统离群点检测算法在类极度不平衡的高维数据集中难以学习离群点的分布模式,导致检测率低的问题,提出了一种生成对抗网络(generative adversarial network,GAN)与变分自编码器(variational auto-encoder,VAE)结合的GAN-VAE算法。算法首先将离群点输入VAE训练,学习离群点的分布模式;然后将VAE与GAN结合训练,生成更多潜在离群点,同时学习正常点与离群点的分类边界;最后将测试数据输入训练后的GAN-VAE,根据正常点与离群点相对密度的差异性计算每个对象的离群值,将离群值高的对象判定为离群点。在四个真实数据集上与六个离群点检测算法进行对比实验,结果表明GAN-VAE在AUC、准确率和F;值上平均提高了5.64%、5.99%和13.30%,证明GAN-VAE算法是有效可行的。  相似文献   

12.
高维空间中的离群点发现   总被引:35,自引:2,他引:33  
在许多KDD(knowledge discovery in databases)应用中,如电子商务中的欺诈行为监测,例外情况或离群点的发现比常规知识的发现更有意义.现有的离群点发现大多是针对数值属性的,而且这些方法只能发现离群点,不能对其含义进行解释.提出了一种基于超图模型的离群点(outlier)定义,这一定义既体现了"局部"的概念,又能很好地解释离群点的含义.同时给出了HOT(hypergraph-based outlier test)算法,通过计算每个点的支持度、隶属度和规模偏差来检测离群点.该算法既能够处理数值属性,又能够处理类别属性.分析表明,该算法能有效地发现高维空间数据中的离群点.  相似文献   

13.
Summary. This paper presents adaptive algorithms for mutual exclusion using only read and write operations; the performance of the algorithms depends only on the point contention, i.e., the number of processes that are concurrently active during the algorithm execution (and not on n, the total number of processes). Our algorithm has O(k) remote step complexity and O(logk) system response time, wherek is the point contention. The remote step complexity is the maximal number of steps performed by a process where a wait is counted as one step. The system response time is the time interval between subsequent entries to the critical section, where one time unit is the minimal interval in which every active process performs at least one step. The space complexity of this algorithm is O(N logn), where N is the range of processes' names. We show how to make the space complexity of our algorithm depend solely on n, while preserving the other performance measures of the algorithm. Received: March 2001 / Accepted: November 2001  相似文献   

14.
Geometric groundtruth at the character, word, and line levels is crucial for designing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick proposed a closed-loop methodology for generating geometric groundtruth for rescanned document images. The procedure assumed that the original image and the corresponding groundtruth were available. It automatically registered the original image to the rescanned one using four corner points and then transformed the original groundtruth using the estimated registration transformation. In this paper, we present an attributed branch-and-bound algorithm for establishing the point correspondence that uses all the data points. We group the original feature points into blobs and use corners of blobs for matching. The Euclidean distance between character centroids is used as the error metric. We conducted experiments on synthetic point sets with varying layout complexity to characterize the performance of two matching algorithms. We also report results on experiments conducted using the University of Washington dataset. Finally, we show examples of application of this methodology for generating groundtruth for microfilmed and FAXed versions of the University of Washington dataset documents. Received: July 24, 2001 / Accepted: May 20, 2002  相似文献   

15.
赵峰  秦锋 《计算机工程》2009,35(19):78-80
研究基于单元的孤立点检测算法,给出数据空间的单元格划分及数据对象分配算法。针对该算法中阈值M设置的不足,对算法进行改进并应用于纳税行为的分析。与其他孤立点检测算法对比的结果表明,该算法不仅能有效挖掘纳税行为中的孤立点,还能确定孤立点的位置,有利于对纳税行为的分析。  相似文献   

16.
On Detecting Spatial Outliers   总被引:1,自引:1,他引:0  
The ever-increasing volume of spatial data has greatly challenged our ability to extract useful but implicit knowledge from them. As an important branch of spatial data mining, spatial outlier detection aims to discover the objects whose non-spatial attribute values are significantly different from the values of their spatial neighbors. These objects, called spatial outliers, may reveal important phenomena in a number of applications including traffic control, satellite image analysis, weather forecast, and medical diagnosis. Most of the existing spatial outlier detection algorithms mainly focus on identifying single attribute outliers and could potentially misclassify normal objects as outliers when their neighborhoods contain real spatial outliers with very large or small attribute values. In addition, many spatial applications contain multiple non-spatial attributes which should be processed altogether to identify outliers. To address these two issues, we formulate the spatial outlier detection problem in a general way, design two robust detection algorithms, one for single attribute and the other for multiple attributes, and analyze their computational complexities. Experiments were conducted on a real-world data set, West Nile virus data, to validate the effectiveness of the proposed algorithms.
Feng Chen (Corresponding author)Email:
  相似文献   

17.
粗糙集中的距离度量与离群点检测   总被引:1,自引:0,他引:1  
针对传统的基于距离的离群点检测方法不能有效地处理具有离散型属性数据集的问题,将基于距离的离群点检测方法引入粗糙集理论,利用粗糙集解决离散型属性的处理问题.首先,在粗糙集的框架中提出3种面向离散型属性的距离度量;然后,针对这3种距离度量分别设计出相应的离群点检测算法,用于从包含离散型属性的数据集中检测离群点;最后,通过在2个包含离散型属性的UCI数据集上的实验,验证了这些算法的可行性和有效性.  相似文献   

18.
Association Rule Mining algorithms operate on a data matrix (e.g., customers products) to derive association rules [AIS93b, SA96]. We propose a new paradigm, namely, Ratio Rules, which are quantifiable in that we can measure the “goodness” of a set of discovered rules. We also propose the “guessing error” as a measure of the “goodness”, that is, the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. Another contribution is a novel method to guess missing/hidden values from the Ratio Rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can “guess” the amount spent on butter. Thus, unlike association rules, Ratio Rules can perform a variety of important tasks such as forecasting, answering “what-if” scenarios, detecting outliers, and visualizing the data. Moreover, we show that we can compute Ratio Rules in a single pass over the data set with small memory requirements (a few small matrices), in contrast to association rule mining methods which require multiple passes and/or large memory. Experiments on several real data sets (e.g., basketball and baseball statistics, biological data) demonstrate that the proposed method: (a) leads to rules that make sense; (b) can find large itemsets in binary matrices, even in the presence of noise; and (c) consistently achieves a “guessing error” of up to 5 times less than using straightforward column averages. Received: March 15, 1999 / Accepted: November 1, 1999  相似文献   

19.
Some significant progress related to multidimensional data analysis has been achieved in the past few years, including the design of fast algorithms for computing datacubes, selecting some precomputed group-bys to materialize, and designing efficient storage structures for multidimensional data. However, little work has been carried out on multidimensional query optimization issues. Particularly the response time (or evaluation cost) for answering several related dimensional queries simultaneously is crucial to the OLAP applications. Recently, Zhao et al. first exploited this problem by presenting three heuristic algorithms. In this paper we first consider in detail two cases of the problem in which all the queries are either hash-based star joins or index-based star joins only. In the case of the hash-based star join, we devise a polynomial approximation algorithm which delivers a plan whose evaluation cost is $ O(n^{\epsilon }$) times the optimal, where n is the number of queries and is a fixed constant with . We also present an exponential algorithm which delivers a plan with the optimal evaluation cost. In the case of the index-based star join, we present a heuristic algorithm which delivers a plan whose evaluation cost is n times the optimal, and an exponential algorithm which delivers a plan with the optimal evaluation cost. We then consider a general case in which both hash-based star-join and index-based star-join queries are included. For this case, we give a possible improvement on the work of Zhao et al., based on an analysis of their solutions. We also develop another heuristic and an exact algorithm for the problem. We finally conduct a performance study by implementing our algorithms. The experimental results demonstrate that the solutions delivered for the restricted cases are always within two times of the optimal, which confirms our theoretical upper bounds. Actually these experiments produce much better results than our theoretical estimates. To the best of our knowledge, this is the only development of polynomial algorithms for the first two cases which are able to deliver plans with deterministic performance guarantees in terms of the qualities of the plans generated. The previous approaches including that of [ZDNS98] may generate a feasible plan for the problem in these two cases, but they do not provide any performance guarantee, i.e., the plans generated by their algorithms can be arbitrarily far from the optimal one. Received: July 21, 1998 / Accepted: August 26, 1999  相似文献   

20.
离群点是与其他正常点属性不同的一类对象,其检测技术在各行业上均有维护数据纯度、保障业内安全等重要应用,现有算法大多是基于距离、密度等传统方法判断检测离群点.本算法给每个对象分配一个"孤立度",即该点相对其邻点的孤立程度,通过排序进行判定,比传统算法效率更高.在AP(affinity propagation)聚类算法的基础上进行改进与优化,提出能检测异常数据点的算法APO(outlier detection algorithm based on affinity propagation).通过加入孤立度模块并计算处理样本点的孤立信息,并引入放大因子,使其与正常点之间的差异更明显,通过增大算法对离群点的敏感性,提高算法的准确性.分别在模拟数据集和真实数据集上进行对比实验,结果表明:该算法与AP算法相比,对离群点的敏感性更加强烈,且本算法检测离群点的同时也能聚类,是其他检测算法所不具备的.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号