首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
周宇欢  蒋大伟  龚勇  陈聪 《计算机科学》2017,44(Z6):380-384
为了在不解密加密数据的前提下获取加密数据流的类型信息,提出一种基于数据随机性特征和模式识别的加密数据流识别方法。该方法利用加密数据与非加密数据,或者不同类型加密数据0,1分布的随机性特性作为分类特征,再利用模式识别方法对不同数据进行建模,从而实现对不同类型数据的自动识别。首先利用NIST随机性测试方法对数据流进行分析,将得到的15类随机性测试得分作为分类特征;然后对不同类型的数据流分别建立分类模型;最后利用训练好的数据模型对未知数据流进行识别。仿真实验显示,与仅用单个随机性特征进行明密数据识别相比,采用模式识别方法可以将错分率由原来的60%以上下降到30%左右;进一步利用滤波器方法对15类随机性特征进行优化降维,平均错分率进一步下降到15%左右。  相似文献   

2.
Recent research shows that rule based models perform well while classifying large data sets such as data streams with concept drifts. A genetic algorithm is a strong rule based classification algorithm which is used only for mining static small data sets. If the genetic algorithm can be made scalable and adaptable by reducing its I/O intensity, it will become an efficient and effective tool for mining large data sets like data streams. In this paper a scalable and adaptable online genetic algorithm is proposed to mine classification rules for the data streams with concept drifts. Since the data streams are generated continuously in a rapid rate, the proposed method does not use a fixed static data set for fitness calculation. Instead, it extracts a small snapshot of the training example from the current part of data stream whenever data is required for the fitness calculation. The proposed method also builds rules for all the classes separately in a parallel independent iterative manner. This makes the proposed method scalable to the data streams and also adaptable to the concept drifts that occur in the data stream in a fast and more natural way without storing the whole stream or a part of the stream in a compressed form as done by the other rule based algorithms. The results of the proposed method are comparable with the other standard methods which are used for mining the data streams.  相似文献   

3.
The CQL continuous query language: semantic foundations and query execution   总被引:2,自引:0,他引:2  
CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on “black-box” mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing. Edited by M. Franklin  相似文献   

4.
社交网络平台产生海量的短文本数据流,具有快速、海量、概念漂移、文本长度短小、类标签大量缺失等特点.为此,文中提出基于向量表示和标签传播的半监督短文本数据流分类算法,可对仅含少量有标记数据的数据集进行有效分类.同时,为了适应概念漂移,提出基于聚类簇的概念漂移检测算法.在实际短文本数据流上的实验表明,相比半监督分类算法和半监督数据流分类算法,文中算法不仅提高分类精度和宏平均,还能快速适应数据流中的概念漂移.  相似文献   

5.
动态数据流具有数据量大、变化快、随机存取代价高、详细数据难以存储等特点,挖掘动态数据流对计算能力与存储能力要求非常高。针对动态数据流的以上特点,设计了一种基于自助抽样的动态数据流贝叶斯分类算法,算法运用滑动窗口模型对动态数据流进行处理分析。该模型以每个窗口的数据为基本单位,对窗口内的数据进行处理分析;算法采用自助抽样技术对待分类数据中的属性进行裁剪和优化,解决了数据属性间的多重线性相关问题;算法结合贝叶斯算法的特点,采用动态增量存储树来解决动态样本数据流的存储问题,实现了无限动态数据流无信息失真的静态有限存储,解决了动态数据流挖掘最大的难题——数据存储;对优化的待分类数据使用all-贝叶斯分类器和k-贝叶斯分类器进行分类,结合数据流的特性对两个分类器进行实时更新。该算法有效克服了贝叶斯分类属性独立性的约束和传统贝叶斯只对静态数据分类的缺点,克服了动态数据流最大的难题——数据存储问题。通过实验测试证明,基于自助抽样的贝叶斯分类具有很高的时效性和精确性。  相似文献   

6.
Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. As data is evolving on a temporal basis, its underlying knowledge is subject to many challenges. Concept drift,1 as one of core challenge from the stream learning community, is described as changes of statistical properties of the data over time, causing most of machine learning models to be less accurate as changes over time are in unforeseen ways. This is particularly problematic as the evolution of data could derive to dramatic change in knowledge. We address this problem by studying the semantic representation of data streams in the Semantic Web, i.e., ontology streams. Such streams are ordered sequences of data annotated with ontological vocabulary. In particular we exploit three levels of knowledge encoded in ontology streams to deal with concept drifts: i) existence of novel knowledge gained from stream dynamics, ii) significance of knowledge change and evolution, and iii) (in)consistency of knowledge evolution. Such knowledge is encoded as knowledge graph embeddings through a combination of novel representations: entailment vectors, entailment weights, and a consistency vector. We illustrate our approach on classification tasks of supervised learning. Key contributions of the study include: (i) an effective knowledge graph embedding approach for stream ontologies, and (ii) a generic consistent prediction framework with integrated knowledge graph embeddings for dealing with concept drifts. The experiments have shown that our approach provides accurate predictions towards air quality in Beijing and bus delay in Dublin with real world ontology streams.  相似文献   

7.
This paper introduces a class of join algorithms, termed W-join, for joining multiple infinite data streams. W-join addresses the infinite nature of the data streams by joining stream data items that lie within a sliding window and that match a certain join condition. In addition to its general applicability in stream query processing, W-join can be used to track the motion of a moving object or detect the propagation of clouds of hazardous material or pollution spills over time in a sensor network environment. We describe two new algorithms for W-join and address variations and local/global optimizations related to specifying the nature of the window constraints to fulfill the posed queries. The performance of the proposed algorithms is studied experimentally in a prototype stream database system, using synthetic data streams and real time-series data. Tradeoffs of the proposed algorithms and their advantages and disadvantages are highlighted, given variations in the aggregate arrival rates of the input data streams and the desired response times per query. This is an extended version of the paper published in the Proceedings of the 15th International Conference on Scientific and Statistical Database Management, SSDBM 2003, Boston, U.S.A., pp. 75–84.  相似文献   

8.
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

9.
大部分数据流分类算法解决了数据流无限长度和概念漂移这两个问题。但是,这些算法需要人工专家将全部实例都标记好作为训练集来训练分类器,这在数据流高速到达并需要快速分类的环境中是不现实的,因为标记实例需要时间和成本。此时,如果采用监督学习的方法来训练分类器,由于标记数据稀少将得到一个弱分类器。提出一种基于主动学习的数据流分类算法,该算法通过选择全部实例中的一小部分来人工标记,其中这小部分实例是分类置信度较低的样本,从而可以极大地减少需要人工标记的实例数量。实验结果表明,该算法可以在数据流存在概念漂移情况下,使用较少的标记数据对数据流训练出分类器,并且分类效果良好。  相似文献   

10.
数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少。为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时。考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法。该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型。在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高。  相似文献   

11.
在介绍数据流及数据流系统的模型后,对降载时的系统约束、输出质量目标进行了正式阐述。提出数据流系统降载策略的分类方法 ,着重分析了目前一些较为重要的数据流系统降载策略,指出其特征和应用范围 ,最后总结了好的数据流降载策略应具有的特点以及未来研究的发展趋势。  相似文献   

12.
To enable efficiency in stream processing, the evaluation of a query is usually performed over bounded parts of (potentially) unbounded streams, i.e., processing windows “slide” over the streams. To avoid inefficient re-evaluations of already evaluated parts of a stream in respect to a query, incremental evaluation strategies are applied, i.e., the query results are obtained incrementally from the result set of the preceding processing state without having to re-evaluate all input buffers. This method is highly efficient but it comes at the cost of having to maintain processing state, which is not trivial, and may defeat performance advantages of the incremental evaluation strategy. In the context of RDF streams the problem is further aggravated by the hard-to-predict evolution of the structure of RDF graphs over time and the application of sub-optimal implementation approaches, e.g., using relational technologies for storing data and processing states which incur significant performance drawbacks for graph-based query patterns. To address these performance problems, this paper proposes a set of novel operator-aware data structures coupled with incremental evaluation algorithms which outperform the counterparts of relational stream processing systems. This claim is demonstrated through extensive experimental results on both simulated and real datasets.  相似文献   

13.
张译天  于炯  鲁亮  李梓杨 《计算机应用》2019,39(4):1106-1116
新型大数据流式计算框架Apache Heron默认使用轮询调度算法进行任务调度,忽略了拓扑运行时状态以及任务实例间不同通信方式对系统性能的影响。针对这个问题,提出Heron环境下流分类任务调度策略(DSC-Heron),包括流分类算法、流簇分配算法和流分类调度算法。首先通过建立Heron作业模型明确任务实例间不同通信方式的通信开销差异;其次基于流分类模型,根据任务实例间实时数据流大小对数据流进行分类;最后将相互关联的高频数据流整体作为基本调度单元构建任务分配计划,在满足资源约束条件的同时尽可能多地将节点间通信转化为节点内通信以最小化系统通信开销。在包含9个节点的Heron集群环境下分别运行SentenceWordCount、WordCount和FileWordCount拓扑,结果表明DSC-Heron相对于Heron默认调度策略,在系统完成时延、节点间通信开销和系统吞吐量上分别平均优化了8.35%、7.07%和6.83%;在负载均衡性方面,工作节点的CPU占用率和内存占用率标准差分别平均下降了41.44%和41.23%。实验结果表明,DSC-Heron对测试拓扑的运行性能有一定的优化作用,其中对接近真实应用场景的FileWordCount拓扑优化效果最为显著。  相似文献   

14.
Enterprise Communication Systems are designed in such a way to maximise the efficiency of communication and collaboration within the enterprise. With users becoming mobile, the Internet of Things (IoT) can play a crucial role in this process, but is far from being seamlessly integrated into modern online communications. In this paper, we present a semantic infrastructure for gathering, integrating and reasoning upon heterogeneous, distributed and continuously changing data streams by means of semantic technologies and rule-based inference. Our solution exploits semantics to go beyond today’s ad-hoc integration and processing of heterogeneous data sources for static and streaming data. It provides flexible and efficient processing techniques that can transform low-level data into high-level abstractions and actionable knowledge, bridging the gap between IoT and online Enterprise Communication Systems. We document the technologies used for acquisition and semantic enrichment of sensor data, continuous semantic query processing for integration and filtering, as well as stream reasoning for decision support. Our main contributions are the following, (i) we define and deploy a semantic processing pipeline for IoT-enabled Communication Systems, which builds upon existing systems for semantic data acquisition, continuous query processing and stream reasoning, detailing the implementation of each component of our framework; (ii) we present a rich semantic information model for representing and linking IoT data, social data and personal data in the Enterprise Communication scenario, by reusing and extending existing standard semantic models; (iii) we define and develop an expressive stream reasoning component as part of our framework, based on continuous query processing and non-monotonic reasoning for semantic streams, (iv) we conduct experiments to comparatively evaluate the performance of our data acquisition and semantic annotation layer based on OpenIoT, and the performance of our expressive reasoning layer in the scenario of Enterprise Communication.  相似文献   

15.
微博、脸书等社交网络平台涌现的短文本数据流具有海量、高维稀疏、快速可变等特性,使得短文本数据流分类面临着巨大挑战。已有的短文本数据流分类方法难以有效地解决特征高维稀疏问题,并且在处理海量数据流时时间代价较高。基于此,提出一种基于Spark的分布式快速短文本数据流分类方法。一方面,利用外部语料库构建Word2vec词向量模型解决了短文本的高维稀疏问题,并构建扩展词向量库以适应文本的快速可变性,提出一种LR分类器集成模型用于短文本数据流分类,该分类器使用一种FTRL方法实现模型参数的在线更新,并引入时间因子加权机制以适应概念漂移环境;另一方面,所提方法的使用分布式处理提高了海量短文本数据流的处理效率。在3个真实短文本数据流上的实验表明:所提方法在提高分类精度的同时,降低了时间消耗。  相似文献   

16.
Many database applications require efficient processing of data streams with value variations and fluctuant sampling frequency.The variations typically imply fundamental features of the stream and important domain knowledge of underlying objects.In some data streams,successive events seem to recur in a certain time interval,but the data indeed evolves with tiny differences as time elapses.This feature,so called pseudo periodicity,poses a new challenge to stream variation management.This study focuses on the online management for variations over such streams.The idea can be applied to many scenarios such as patient vital signal monitoring in medical applications.This paper proposes a new method named Pattern Growth Graph (PGG) to detect and manage variations over evolving streams with following features:1) adopts the wave-pattern to capture the major information of data evolution and represent them compactly; 2) detects the variations in a single pass over the stream with the help of wave-pattern matching algorithm;3) only stores different segments of the pattern for incoming stream,and hence substantially compresses the data without losing important information;4) distinguishes meaningful data changes from noise and reconstructs the stream with acceptable accuracy. Extensive experiments on real datasets containing millions of data items,as well as a prototype system,are carried out to demonstrate the feasibility and effectiveness of the proposed scheme.  相似文献   

17.
在开放环境下,数据流具有数据高速生成、数据量无限和概念漂移等特性.在数据流分类任务中,利用人工标注产生大量训练数据的方式昂贵且不切实际.包含少量有标记样本和大量无标记样本且还带概念漂移的数据流给机器学习带来了极大挑战.然而,现有研究主要关注有监督的数据流分类,针对带概念漂移的数据流的半监督分类的研究尚未引起足够的重视....  相似文献   

18.
This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.  相似文献   

19.
In many data stream mining applications, traditional density estimation methods such as kernel density estimation, reduced set density estimation can not be applied to the density estimation of data streams because of their high computational burden, processing time and intensive memory allocation requirement. In order to reduce the time and space complexity, a novel density estimation method Dm-KDE over data streams based on the proposed algorithm m-KDE which can be used to design a KDE estimator with the fixed number of kernel components for a dataset is proposed. In this method, Dm-KDE sequence entries are created by algorithm m-KDE instead of all kernels obtained from other density estimation methods. In order to further reduce the storage space, Dm-KDE sequence entries can be merged by calculating their KL divergences. Finally, the probability density functions over arbitrary time or entire time can be estimated through the obtained estimation model. In contrast to the state-of-the-art algorithm SOMKE, the distinctive advantage of the proposed algorithm Dm-KDE exists in that it can achieve the same accuracy with much less fixed number of kernel components such that it is suitable for the scenarios where higher on-line computation about the kernel density estimation over data streams is required.We compare Dm-KDE with SOMKE and M-kernel in terms of density estimation accuracy and running time for various stationary datasets. We also apply Dm-KDE to evolving data streams. Experimental results illustrate the effectiveness of the proposed method.  相似文献   

20.
王全 《计算机应用》2007,27(10):2372-2375
提出一种能够适应数据流突变式概念变化的增量分类算法,采用网格技术对数据集特征向量进行量化,利用Haar小波多种分辨率的数据表示方式,基于最近邻技术发现测试点的合适类标签。在真实数据集上的测试证明,与已存在的数据流分类算法相比,提出的分类算法精度较高,具有很低的更新代价,适合数据流应用的需求。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号