首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The extraction of models from data streams has become a hot topic in data mining due to the proliferation of problems in which data are made available online. This has led to the design of several systems that create data models online. A novel approach to online learning of data streams can be found in Fuzzy-UCS, a young Michigan-style fuzzy-classifier system that has recently demonstrated to be highly competitive in extracting classification models from complex domains. Despite the promising results reported for Fuzzy-UCS, there still remain some hot issues that need to be analyzed in detail. This paper carefully studies two key aspects in Fuzzy-UCS: the ability of the system to learn models from data streams where concepts change over time and the behavior of different fuzzy representations. Four fuzzy representations that move through the dimensions of flexibility and interpretability are included in the system. The behavior of the different representations on a problem with concept changes is studied and compared to other machine learning techniques prepared to deal with these types of problems. Thereafter, the comparison is extended to a large collection of real-world problems, and a close examination of which problem characteristics benefit or affect the different representations is conducted. The overall results show that Fuzzy-UCS can effectively deal with problems with concept changes and lead to different interesting conclusions on the particular behavior of each representation.  相似文献   

2.
We discuss the problem of capturing media streams which occur during a live lecture in class or during a telepresentation. Instead of presenting yet another method or system for capturing the classroom experience, we introduce some informal guidelines and show their importance for such a system. We derive from these guidelines a formal framework for sets of data streams and an application model to handle these sets so that a real-time replay becomes possible. The Authoring on the Fly system is a possible realization of a framework which follows these guidelines. It allows the capture and real-time replay of data streams captured during a (tele)presentation, including audio, video, and whiteboard action streams. This article gives an overview of the different AoF system components for the various phases of the teaching and learning cycle. It comprises an integrated text and graphics editor for the preparation of pages to be loaded by the whiteboard during the presentation phase. The recording component of the system captures various data streams of the live presentation. They are postprocessed by the system so that they become instances of the class of media for whose replay the general application model was developed. From a global point of view, the Authoring on the Fly system allows one to merge three apparently distinct tasks – teaching in class, telepresentation, and multimedia authoring – into one single activity. The system has been used routinely for recording telepresentations over the MBone net and has already led to a large number of multimedia documents which have been integrated automatically into Web-based teaching and learning environments.  相似文献   

3.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

4.
数据流系统中卸载技术研究综述   总被引:2,自引:1,他引:1  
随着数据流应用系统的快速流行,流数据管理对数据库技术提出了巨大挑战。由于数据流经常是爆发性的且数据特征可能随时变化,因此要求数据流管理系统具有很好的自适应性。当输入速率超过系统处理能力时,系统会产生过载且性能下降。为了解决这一问题,卸载技术是有效的途径之一。卸载时间、卸载地点和卸载数量是与卸载技术密切相关的三个主要问题,本文主要从这三个方面来综述和分析目前各个数据流系统所采用的卸载技术。  相似文献   

5.
动态非平衡数据分类是在线学习和类不平衡学习领域重要的研究问题,用于处理类分布非常倾斜的数据流。这类问题在实际场景中普遍存在,如实时控制监控系统的故障诊断和计算机网络中的入侵检测等。由于动态数据流中存在概念漂移现象和不平衡问题,因此数据流分类算法既要处理概念漂移,又要解决类不平衡问题。针对以上问题,提出了在检测概念漂移的同时对非平衡数据进行处理的一种方法。该方法采用Kappa系数检测概念漂移,进而检测平衡率,利用非平衡数据分类方法更新分类器。实验结果表明,在不同的评价指标上,该算法对非平衡数据流具有较好的分类性能。  相似文献   

6.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

7.
由于现有各种机器学习算法本质上都基于一个静态学习环境,而以尽量保证学习系统泛化能力为目标的寻优过程,概念漂移数据流分类给机器学习带来了巨大挑战.从数据流与概念漂移、概念漂移数据流分类研究的发展与趋势、概念漂移数据流分类的主要研究领域、概念漂移数据流分类研究的新动态4个方面展开了文献综述,并分析了当前概念漂移数据流分类算法存在的问题.  相似文献   

8.
The emergence of novel techniques for automatic anomaly detection in surveillance videos has significantly reduced the burden of manual processing of large, continuous video streams. However, existing anomaly detection systems suffer from a high false-positive rate and also, are not real-time, which makes them practically redundant. Furthermore, their predefined feature selection techniques limit their application to specific cases. To overcome these shortcomings, a dynamic anomaly detection and localization system is proposed, which uses deep learning to automatically learn relevant features. In this technique, each video is represented as a group of cubic patches for identifying local and global anomalies. A unique sparse denoising autoencoder architecture is used, that significantly reduced the computation time and the number of false positives in frame-level anomaly detection by more than 2.5%. Experimental analysis on two benchmark data sets - UMN dataset and UCSD Pedestrian dataset, show that our algorithm outperforms the state-of-the-art models in terms of false positive rate, while also showing a significant reduction in computation time.  相似文献   

9.
Social networking sites such as Facebook or Twitter attract millions of users, who everyday post an enormous amount of content in the form of tweets, comments and posts. Since social network texts are usually short, learning tasks have to deal with a very high dimensional and sparse feature space, in which most features have low frequencies. As a result, extracting useful knowledge from such noisy data is a challenging task, that converts large-scale short-text learning tasks in social environments into one of the most relevant problems in machine learning and data mining. Feature selection is one of the most known and commonly used techniques for reducing the impact of the high dimensional feature space in text learning. A wide variety of feature selection techniques can be found in the literature applied to traditional, long-texts and document collections. However, short-texts coming from the social Web pose new challenges to this well-studied problem as texts’ shortness offers a limited context to extract enough statistical evidence about words relations (e.g. correlation), and instances usually arrive in continuous streams (e.g. Twitter timeline), so that the number of features and instances is unknown, among other problems. This paper surveys feature selection techniques for dealing with short texts in both offline and online settings. Then, open issues and research opportunities for performing online feature selection over social media data are discussed.  相似文献   

10.
The number of Internet of Things devices generating data streams is expected to grow exponentially with the support of emergent technologies such as 5G networks. Therefore, the online processing of these data streams requires the design and development of suitable machine learning algorithms, able to learn online, as data is generated. Like their batch-learning counterparts, stream-based learning algorithms require careful hyperparameter settings. However, this problem is exacerbated in online learning settings, especially with the occurrence of concept drifts, which frequently require the reconfiguration of hyperparameters. In this article, we present SSPT, an extension of the Self Parameter Tuning (SPT) optimisation algorithm for data streams. We apply the Nelder–Mead algorithm to dynamically-sized samples, converging to optimal settings in a single pass over data while using a relatively small number of hyperparameter configurations. In addition, our proposal automatically readjusts hyperparameters when concept drift occurs. To assess the effectiveness of SSPT, the algorithm is evaluated with three different machine learning problems: recommendation, regression, and classification. Experiments with well-known data sets show that the proposed algorithm can outperform previous hyperparameter tuning efforts by human experts. Results also show that SSPT converges significantly faster and presents at least similar accuracy when compared with the previous double-pass version of the SPT algorithm.  相似文献   

11.
复杂数据流中所存在的概念漂移及不平衡问题降低了分类器的性能。传统的批量学习算法需要考虑内存以及运行时间等因素,在快速到达的海量数据流中性能并不突出,并且其中还包含着大量的漂移及类失衡现象,利用在线集成算法处理复杂数据流问题已经成为数据挖掘领域重要的研究课题。从集成策略的角度对bagging、boosting、stacking集成方法的在线版本进行了介绍与总结,并对比了不同模型之间的性能。首次对复杂数据流的在线集成分类算法进行了详细的总结与分析,从主动检测和被动自适应两个方面对概念漂移数据流检测与分类算法进行了介绍,从数据预处理和代价敏感两个方面介绍不平衡数据流,并分析了代表性算法的时空效率,之后对使用相同数据集的算法性能进行了对比。最后,针对复杂数据流在线集成分类研究领域的挑战提出了下一步研究方向。  相似文献   

12.
This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, “subsampling”, is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, “logical windowing”, is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets—one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is “yes”. This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.  相似文献   

13.
One-class learning and concept summarization for data streams   总被引:2,自引:2,他引:0  
In this paper, we formulate a new research problem of concept learning and summarization for one-class data streams. The main objectives are to (1) allow users to label instance groups, instead of single instances, as positive samples for learning, and (2) summarize concepts labeled by users over the whole stream. The employment of the batch-labeling raises serious issues for stream-oriented concept learning and summarization, because a labeled instance group may contain non-positive samples and users may change their labeling interests at any time. As a result, so the positive samples labeled by users, over the whole stream, may be inconsistent and contain multiple concepts. To resolve these issues, we propose a one-class learning and summarization (OCLS) framework with two major components. In the first component, we propose a vague one-class learning (VOCL) module for concept learning from data streams using an ensemble of classifiers with instance level and classifier level weighting strategies. In the second component, we propose a one-class concept summarization (OCCS) module that uses clustering techniques and a Markov model to summarize concepts labeled by users, with only one scanning of the stream data. Experimental results on synthetic and real-world data streams demonstrate that the proposed VOCL module outperforms its peers for learning concepts from vaguely labeled stream data. The OCCS module is also able to rebuild a high-level summary for concepts marked by users over the stream.  相似文献   

14.
Bioinformatics aims at applying computer science methods to the wealth of data collected in a variety of experiments in life sciences (e.g. cell and molecular biology, biochemistry, medicine, etc.) in order to help analysing such data and eliciting new knowledge from it. In addition to string processing bioinformatics is often identified with machine learning used for mining the large banks of bio-data available in electronic format, namely in a number of web servers. Nevertheless, there are opportunities of applying other computational techniques in some bioinformatics applications. In this paper, we report the application of constraint programming to address two structural bioinformatics problems, protein structure prediction and protein interaction (docking). The efficient application of constraint programming requires innovative modelling of these problems, as well as the development of advanced propagation techniques (e.g. global reasoning and propagation), which were adopted in Chemera, a system that is currently used to support biochemists in their research.  相似文献   

15.
迁移学习利用源域中丰富的数据来为目标域构建精确的模型提供辅助和支持。特征迁移学习是迁移学习中被广泛研究的一类技术,但是现有的特征迁移方法面临着如下的问题:一些已有的方法仅能实现线性的特征迁移学习,因此这些方法迁移学习的能力有限。另一类方法虽然能实现非线性特征迁移学习,但往往需要引进核技巧等策略,这使得特征迁移的过程难以理解。针对此,引入模糊推理技术,提出基于不确定推理规则的特征迁移方法。该方法基于模糊推理系统来实现特征迁移,并利用流形正则化技术来避免特征迁移过程中的信息损失。由于模糊系统具有很好的非线性建模能力以及基于规则的良好的解释性,因此提出的方法具有良好的非线性特征迁移能力,并易于对新特征进行理解。大量实验证明,该算法在跨域图像分类问题上可以明显优于已有的多种方法。  相似文献   

16.
Statistical Learning for Humanoid Robots   总被引:7,自引:0,他引:7  
The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for real-time learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even high-dimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the on-line learning of three problems in humanoid motor control: the learning of inverse dynamics models for model-based control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that real-time learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future.  相似文献   

17.
基于OLAP技术的教学诊断与评价模型   总被引:11,自引:0,他引:11  
王陆  李亚文 《计算机工程》2003,29(5):49-50,194
结合首师大虚拟学习社区网络教学支撑平台,提出了一个基于联机分析技术(OLAP)的教学诊断与评价模型,文中使用DMQL语言给出了一个由学生,知识点,时间和认知技能4个维度构成的数据立方体,以及利用OLAP的上卷,下钻、切块和切片等技术实现对该数据方体进行数据挖掘的解决方案,提出的基于OLAP技术的数据挖掘方案,可以在教学诊断与评价中解决诸如判断学习者学习难点,了解学习者集成或个人的学习特征,以及获得学习者认知过程等问题,为实现因材施教的个性化教学和以学生为中心的自主学习提供支持与服务。  相似文献   

18.
Data mining techniques are traditionally divided into two distinct disciplines depending on the task to be performed by the algorithm: supervised learning and unsupervised learning. While the former aims at making accurate predictions after deeming an underlying structure in data—which requires the presence of a teacher during the learning phase—the latter aims at discovering regular-occurring patterns beneath the data without making any a priori assumptions concerning their underlying structure. The pure supervised model can construct a very accurate predictive model from data streams. However, in many real-world problems this paradigm may be ill-suited due to (1) the dearth of training examples and (2) the costs of labeling the required information to train the system. A sound use case of this concern is found when defining data replication and partitioning policies to store data emerged in the Smart Grids domain in order to adapt electric networks to current application demands (e.g., real time consumption, network self adapting). As opposed to classic electrical architectures, Smart Grids encompass a fully distributed scheme with several diverse data generation sources. Current data storage and replication systems fail at both coping with such overwhelming amount of heterogeneous data and at satisfying the stringent requirements posed by this technology (i.e., dynamic nature of the physical resources, continuous flow of information and autonomous behavior demands). The purpose of this paper is to apply unsupervised learning techniques to enhance the performance of data storage in Smart Grids. More specifically we have improved the eXtended Classifier System for Clustering (XCSc) algorithm to present a hybrid system that mixes data replication and partitioning policies by means of an online clustering approach. Conducted experiments show that the proposed system outperforms previous proposals and truly fits with the Smart Grid premises.  相似文献   

19.
Vetter  R.J. Du  D.H.C. 《Computer》1993,26(2):8-18
An environment that uses wavelength division multiplexing techniques and optical switching and processing to provide large bandwidths, short delays, and multiple data streams for distributed processing is described. The focus is on the interrelationship between application needs and network services. The system level, a conceptual layer designed to bridge the gap between application requirements and underlying high-speed network services, is proposed. The system level is a logical view of the physical network represented by a virtual topology projected onto the physical network. Embedding this virtual topology introduces many new problems and performance tradeoffs into the design of the network. A few of these problems are outlined, and some initial research efforts in this area are discussed. The physical network level, the collection of optical fiber links interconnecting the nodes in the network, and the application level, a logical view of an application's computational topology and representation of the application's communication and computing requirements, are also described  相似文献   

20.
Process monitoring and diagnosis have been widely recognized as important and critical tools in system monitoring for detection of abnormal behavior and quality improvement. Although traditional statistical process control (SPC) tools are effective in simple manufacturing processes that generate a small volume of independent data, these tools are not capable of handling the large streams of multivariate and autocorrelated data found in modern systems. As the limitations of SPC methodology become increasingly obvious in the face of ever more complex processes, data mining algorithms, because of their proven capabilities to effectively analyze and manage large amounts of data, have the potential to resolve the challenging problems that are stretching SPC to its limits. In the present study we attempted to integrate state-of-the-art data mining algorithms with SPC techniques to achieve efficient monitoring in multivariate and autocorrelated processes. The data mining algorithms include artificial neural networks, support vector regression, and multivariate adaptive regression splines. The residuals of data mining models were utilized to construct multivariate cumulative sum control charts to monitor the process mean. Simulation results from various scenarios indicated that data mining model-based control charts performs better than traditional time-series model-based control charts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号