共查询到20条相似文献,搜索用时 15 毫秒
1.
针对在大规模数据集上进行聚类困难的问题,分析了抽样技术的优点,研究了数据挖掘领域中的随机抽样的特点,并在此基础上提出了一种基于密度的偏差抽样方法.利用密度偏差抽样所获得的样本数据集能够较准确地反映总体数据集的特征,并且能够灵活地控制对数据集不同区域的抽样率.实验证明,在大规模数据集上进行聚类时,密度偏差抽样在时间复杂度上要优于随机抽样. 相似文献
2.
Persistent homology is a computationally intensive and yet extremely powerful tool for Topological Data Analysis. Applying the tool on potentially infinite sequence of data objects is a challenging task. For this reason, persistent homology and data stream mining have long been two important but disjoint areas of data science. The first computational model, that was recently introduced to bridge the gap between the two areas, is useful for detecting steady or gradual changes in data streams, such as certain genomic modifications during the evolution of species. However, that model is not suitable for applications that encounter abrupt changes of extremely short duration. This paper presents another model for computing persistent homology on streaming data that addresses the shortcoming of the previous work. The model is validated on the important real-world application of network anomaly detection. It is shown that in addition to detecting the occurrence of anomalies or attacks in computer networks, the proposed model is able to visually identify several types of traffic. Moreover, the model can accurately detect abrupt changes of extremely short as well as longer duration in the network traffic. These capabilities are not achievable by the previous model or by traditional data mining techniques. 相似文献
3.
For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned
more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained
in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately,
this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams
do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely
delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm
that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within
a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable
method for finding frequent sets in data stream environments when both constraints need to be satisfied. 相似文献
4.
通过改进加权抽样算法,结合基本窗口技术,提出了两种面向带权值数据流上连续更新滑动窗口的随机抽样算法:WRSB算法和IWRSB算法。当新的数据元组到达时,根据数据元组的权值计算出该元组的键值,根据元组键值的大小决定其是否进入样本集以及样本集中被替换的数据元组,同时设置一个系统缓冲区来保存最近到达的键值较大的部分数据元组,作为过期数据元组的后备,使算法能够有效地处理过期数据元组问题。理论分析和实验结果表明,两种算法都能有效地处理带权值数据流上连续更新滑动窗口的随机抽样问题,相比较而言,IWRSB算法具有更好的性能。 相似文献
5.
数据挖掘取样方法研究 总被引:10,自引:0,他引:10
取样是一种通用有效的近似技术.在数据挖掘研究中,取样方法可显著减小所处理数据集的规模,使得众多数据挖掘算法得以应用到大规模数据集以及数据流数据上.通过对应用于数据挖掘领域的代表性取样方法的比较研究和分析总结,提出了一个取样算法分类框架.在指出了均匀取样局限性的基础上阐述了某些应用场景中选用偏倚取样方法的必要性,综述了取样技术在数据挖掘领域的应用研究与应用发展,最后对数据流挖掘取样方法面临的挑战和发展方向进行了展望. 相似文献
6.
Atish Das Sarma Richard J. Lipton Danupon Nanongkai 《Theoretical computer science》2011,412(23):2544-2555
We study a new model of computation, called best-order stream, for graph problems. Roughly, it is a proof system where a space-limited verifier has to verify a proof sequentially (i.e., it reads the proof as a stream). Moreover, the proof itself is just a specific ordering of the input data. This model is closely related to many models of computation in other areas such as data streams, communication complexity, and proof checking, and could be used in applications such as cloud computing.In this paper we focus on graph problems where the input is a sequence of edges. We show that even under this model, checking some basic graph properties deterministically requires linear space in the number of nodes. We also show that, in contrast with this, randomized verifiers are powerful enough to check many graph properties in polylogarithmic space. 相似文献
7.
《Expert systems with applications》2014,41(2):694-708
As data have been accumulated more quickly in recent years, corresponding databases have also become huger, and thus, general frequent pattern mining methods have been faced with limitations that do not appropriately respond to the massive data. To overcome this problem, data mining researchers have studied methods which can conduct more efficient and immediate mining tasks by scanning databases only once. Thereafter, the sliding window model, which can perform mining operations focusing on recently accumulated parts over data streams, was proposed, and a variety of mining approaches related to this have been suggested. However, it is hard to mine all of the frequent patterns in the data stream environment since generated patterns are remarkably increased as data streams are continuously extended. Thus, methods for efficiently compressing generated patterns are needed in order to solve that problem. In addition, since not only support conditions but also weight constraints expressing items’ importance are one of the important factors in the pattern mining, we need to consider them in mining process. Motivated by these issues, we propose a novel algorithm, weighted maximal frequent pattern mining over data streams based on sliding window model (WMFP-SW) to obtain weighted maximal frequent patterns reflecting recent information over data streams. Performance experiments report that MWFP-SW outperforms previous algorithms in terms of runtime, memory usage, and scalability. 相似文献
8.
Improving industrial product reliability, maintainability and thus availability is a challenging task for many industrial companies. In industry, there is a growing need to process data in real time, since the generated data volume exceeds the available storage capacity. This paper consists of a review of data stream mining and data stream management systems aimed at improving product availability. Further, a newly developed and validated grid-based classifier method is presented and compared to one-class support vector machine (OCSVM) and a polygon-based classifier. 相似文献
9.
Hillol Kargupta Souptik Datta Qi Wang Krishnamoorthy Sivakumar 《Knowledge and Information Systems》2005,7(4):387-414
Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications. 相似文献
10.
Random Sampling of Euler Tours 总被引:1,自引:0,他引:1
We define a Markov chain on the set of Euler tours of a given Eulerian graph based on transformations first defined by Kotzig
in 1966. We prove that the chain is rapidly mixing if the maximum degree in the given graph is 6, thus obtaining an efficient
algorithm for sampling and counting the set of Euler tours for such an Eulerian graph.
Received October 30, 1997; revised March 12, 1999, and April 17, 2000. 相似文献
11.
A data stream is a massive and unbounded sequence of data elements that are continuously generated at a fast speed. Compared with traditional approaches, data mining in data streams is more challenging since several extra requirements need to be satisfied. In this paper, we propose a mining algorithm for finding frequent itemsets over the transactional data stream. Unlike most of existing algorithms, our method works based on the theory of Approximate Inclusion–Exclusion. Without incrementally maintaining the overall synopsis of the stream, we can approximate the itemsets’ counts according to certain kept information and the counts bounding technique. Some additional techniques are designed and integrated into the algorithm for performance improvement. Besides, the performance of the proposed algorithm is tested and analyzed through a series of experiments. 相似文献
12.
A false negative approach to mining frequent itemsets from high speed transactional data streams 总被引:2,自引:0,他引:2
Mining frequent itemsets from transactional data streams is challenging due to the nature of the exponential explosion of itemsets and the limit memory space required for mining frequent itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I − 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. The existing studies on finding frequent items from high speed data streams are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter ?, and allow items with support below the specified minimum support s but above s − ? counted as frequent ones. However, such false-positive oriented approaches cannot be effectively applied to frequent itemsets mining for two reasons. First, false-positive items found increase the number of false-positive frequent itemsets exponentially. Second, minimization of the number of false-positive items found, by using a small ?, will make memory consumption large. Therefore, such approaches may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. Our algorithms are based on Chernoff bound in which we use a running error parameter to prune item(set)s and use a reliability parameter to control memory. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms. 相似文献
13.
随着计算机网络技术和网络应用的发展,网络设备的性能测试已成为研究的热点课题.长时间性能测试中的结果提取和存储,是网络性能测试的关键点和难点.分析了网络性能测试的方法和高速性能测试中的关键问题,提出了网络性能测试模型.通过基于“流”的报文抽样方法,有效解决了长时间性能测试中的结果提取与存储问题.最后,该文通过具体实验分析了模型架构的可行性及抽样技术的有效性. 相似文献
14.
Utility of an itemset is considered as the value of this itemset, and utility mining aims at identifying the itemsets with high utilities. The temporal high utility itemsets are the itemsets whose support is larger than a pre-specified threshold in current time window of the data stream. Discovery of temporal high utility itemsets is an important process for mining interesting patterns like association rules from data streams. In this paper, we propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from data streams efficiently and effectively. To the best of our knowledge, this is the first work on mining temporal high utility itemsets from data streams. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets by generating fewer candidate itemsets such that the execution time can be reduced substantially in mining all high utility itemsets in data streams. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively with less memory space and execution time. This meets the critical requirements on time and space efficiency for mining data streams. Through experimental evaluation, THUI-Mine is shown to significantly outperform other existing methods like Two-Phase algorithm under various experimental conditions. 相似文献
15.
提出了一种流数据上的频繁项挖掘算法(SW-COUNT)。该算法通过数据采样技术挖掘滑动窗口下的数据流频繁项。给定的误差ε,SW-COUNT可以在O(ε-1)空间复杂度下,检测误差在εn内的数据流频繁项,对每个数据项的平均处理时间为O(1)。大量的实验证明,该算法比其他类似算法具有较好的精度质量以及时间和空间效率。 相似文献
16.
A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space 总被引:2,自引:1,他引:1
In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams
is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams
has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed.
These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding
frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved
in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization
to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure
with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole
data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets,
and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization
can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series
of experiments is performed to show the effectiveness and the efficiency of this approach. 相似文献
17.
针对命名数据网络中如何高效地对节点内的数据进行替换的问题,对节点内已经缓存的数据块,根据被请求的频率、请求时间间隔,准确判断数据块在当前时间的流行度,提出了一种基于流行度的替换策略Po-Rep。从命中节点返回的数据决定要存储在相应节点时,把节点内流行度低的数据进行剔除替换。该策略使节点的内容保持最大价值,满足后续的用户请求。仿真结果表明,该策略有效提高了网内节点存储的命中率,降低了服务器的负载,提高了网络的整体性能。 相似文献
18.
Ahmad Alzghoul Björn Backe Magnus Löfstrand Arne Byström Bengt Liljedahl 《Computers in Industry》2014
The field of fault detection and diagnosis has been the subject of considerable interest in industry. Fault detection may increase the availability of products, thereby improving their quality. Fault detection and diagnosis methods can be classified in three categories: data-driven, analytically based, and knowledge-based methods. 相似文献
19.
不平衡数据严重影响了传统分类算法的性能,导致少数类的识别率降低。提出一种基于邻域特征的混合抽样技术,该技术根据样本邻域中的类别分布特征来确定采样权重,进而采用混合抽样的方法来获得平衡的数据集;然后采用一种基于局部置信度的动态集成方法,通过分类学习生成基分类器,对于每个检验的样本,根据局部分类精度动态地选择最优的基分类器进行组合。通过UCI标准数据集上的实验表明,该方法能够同时提高不平衡数据中少数类和多数类的分类精度。 相似文献