首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Most current speech recognizers use an observation space based on a temporal sequence of measurements extracted from fixed-length “frames” (e.g., Mel-cepstra). Given a hypothetical word or sub-word sequence, the acoustic likelihood computation always involves all observation frames, though the mapping between individual frames and internal recognizer states will depend on the hypothesized segmentation. There is another type of recognizer whose observation space is better represented as a network, or graph, where each arc in the graph corresponds to a hypothesized variable-length segment that is represented by a fixed-dimensional “feature”. In such feature-based recognizers, each hypothesized segmentation will correspond to a segment sequence, or path, through the overall segment-graph that is associated with a subset of all possible feature vectors in the total observation space. In this work we examine a maximum a posteriori decoding strategy for feature-based recognizers and develop a normalization criterion useful for a segment-based Viterbi or A* search. Experiments are reported for both phonetic and word recognition tasks.  相似文献   

A framework for on-demand classification of evolving data streams   总被引:4,自引:0,他引:4  
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflects real-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains an high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task.  相似文献   

Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

网络信息技术的高速发展产生了新的数据模型,即数据流模型,并且越来越多的领域出现了对数据流实时处理的需求,庞大且高速的数据以及应用场景的实时性需求均推进了数据流挖掘技术的发展。首先介绍了常见的数据流模型;然后根据数据流模型的特点总结数据流挖掘的支撑技术;最后,分析了分布式数据流挖掘的重要性和有效性,给出了算法并行化的数学模型,并介绍了几种具有代表性的分布式数据流处理系统。  相似文献   

A survey on algorithms for mining frequent itemsets over data streams   总被引:9,自引:8,他引:1  
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.  相似文献   

Distributed and Parallel Databases - While the problems of finding the shortest path and k-shortest paths have been extensively researched, the research community has been shifting its focus...  相似文献   

The evaluation of the process of mining associations is an important and challenging problem in database systems and especially those that store critical data and are used for making critical decisions. Within the context of spatial databases we present an evaluation framework in which we use probability distributions to model spatial regions, and Bayesian networks to model the joint probability distribution and the structural relationships among spatial and non-spatial predicates. We demonstrate the applicability of the proposed framework by evaluating representatives from two well-known approaches that are used for learning associations, i.e., dependency analysis (using statistical tests of independence) and Bayesian methods. By controlling the parameters of the framework we provide extensive comparative results of the performance of the two approaches. We obtain measures of recovery of known associations as a function of the number of samples used, the strength, number and type of associations in the model, the number of spatial predicates associated with a particular non-spatial predicate, the prior probabilities of spatial predicates, the conditional probabilities of the non-spatial predicates, the image registration error, and the parameters that control the sensitivity of the methods. In addition to performance we investigate the processing efficiency of the two approaches.  相似文献   

近年来,数据流挖掘一直是国内外研究的热点,频繁项集挖掘又是数据流挖掘中的重要问题。根据数据流无限性和流动性的特点,提出了一种在滑动窗口中挖掘频繁项集的算法FIM-SW,FIM-SW算法主要是采用垂直的数据库表示方法,使用二进制向量表示每个数据项,并利用Apriori性质产生频繁项集。实验结果表明,这种算法显著地提高了挖掘效率。  相似文献   

The goal of data mining is to find out interesting and meaningful patterns from large databases. In some real applications, many data are quantitative and linguistic. Fuzzy data mining was thus proposed to discover fuzzy knowledge from this kind of data. In the past, two mining algorithms based on the ant colony systems were proposed to find suitable membership functions for fuzzy association rules. They transformed the problem into a multi-stage graph, with each route representing a possible set of membership functions, and then, used the any colony system to solve it. They, however, searched for solutions in a discrete solution space in which the end points of membership functions could be adjusted only in a discrete way. The paper, thus, extends the original approaches to continuous search space, and a fuzzy mining algorithm based on the continuous ant approach is proposed. The end points of the membership functions may be moved in the continuous real-number space. The encoding representation and the operators are also designed for being suitable in the continuous space, such that the actual global optimal solution is contained in the search space. Besides, the proposed approach does not have fixed edges and nodes in the search process. It can dynamically produce search edges according to the distribution functions of pheromones in the solution space. Thus, it can get a better nearly global optimal solution than the previous two ant-based fuzzy mining approaches. The experimental results show the good performance of the proposed approach as well.  相似文献   

An ACS-based framework for fuzzy data mining   总被引:1,自引:0,他引:1  
Data mining is often used to find out interesting and meaningful patterns from huge databases. It may generate different kinds of knowledge such as classification rules, clusters, association rules, and among others. A lot of researches have been proposed about data mining and most of them focused on mining from binary-valued data. Fuzzy data mining was thus proposed to discover fuzzy knowledge from linguistic or quantitative data. Recently, ant colony systems (ACS) have been successfully applied to optimization problems. However, few works have been done on applying ACS to fuzzy data mining. This thesis thus attempts to propose an ACS-based framework for fuzzy data mining. In the framework, the membership functions are first encoded into binary-bits and then fed into the ACS to search for the optimal set of membership functions. The problem is then transformed into a multi-stage graph, with each route representing a possible set of membership functions. When the termination condition is reached, the best membership function set (with the highest fitness value) can then be used to mine fuzzy association rules from a database. At last, experiments are made to make a comparison with other approaches and show the performance of the proposed framework.  相似文献   

In a data streaming setting, data points are observed sequentially. The data generating model may change as the data are streaming. In this paper, we propose detecting this change in data streams by testing the exchangeability property of the observed data. Our martingale approach is an efficient, nonparametric, one-pass algorithm that is effective on the classification, cluster, and regression data generating models. Experimental results show the feasibility and effectiveness of the martingale methodology in detecting changes in the data generating model for time-varying data streams. Moreover, we also show that: 1) An adaptive support vector machine (SVM) utilizing the martingale methodology compares favorably against an adaptive SVM utilizing a sliding window, and 2) a multiple martingale video-shot change detector compares favorably against standard shot-change detection algorithms.  相似文献   

Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

重点研究了数据流分类挖掘中存在的概念漂移问题,并在CVFDT算法改进的基础上,提出了一种多重选择决策树算法mCVFDT.该算法将多重属性的选择机制加入到节点结构中,克服了CVFDT无法自动检测概念漂移的缺陷,同时避免了对决策树的重复遍历,提高了算法的分类精度和效率.实验结果证明该,算法随着样本数目的增加,在分类精度上比CVFDT算法有更好的表现.  相似文献   

A spatio-temporal database manages spatio-temporal objects and supports corresponding query languages. Today, the term moving objects databases is used as a synonym for spatio-temporal databases managing spatial objects with a continuously changing geospatial location and/or extent. Recent advances in wireless communication, miniaturization of spatially enabled devices and global navigation satellite systems (GNSS) services have resulted in a large number of novel application domains. Applications in these novel domains (geo-sensor networks, moving objects tracking, real-time traffic analysis, etc.) process huge volumes of continuous data streams, i.e. data sets that are produced incrementally over time, rather than those available in full before the processing begins. Several data stream management systems (DSMSs) have been developed to manage this data. Since they are mainly based on a relational paradigm, they do not support geospatial data. Therefore, there is an urgent need for geospatial data stream management, ranging from real-time monitoring and alerting to long-term analysis of processed geospatial data. In this paper we present a formal framework consisting of data types and operations needed to support geospatial data in data streams. It can be used as a basis either for implementation of a completely new geospatial DSMS, or for extending available open source products and research prototypes. We leverage the work on abstract data types from spatio-temporal databases, present an implementation based on user-defined aggregate functions and illustrate embedding into an SQL-like language.  相似文献   

Simulation studies often fail to provide any useful result due to its success being highly dependent on the skills of the analyst to understand a system and then correctly identify all the required data parameters and dependent variables. This paper describes a template-based framework to help identify and specify the components and data parameters for developing models of physical security systems. The layered framework consists of 15 templates built on top of 14 data primitives representing 119 data parameters. The modeling framework has been programmed as an internet-based web application and is simulation language-independent. The usefulness of the framework was tested and shown to have a significant impact on improving the identification of system components and their associated data parameters.  相似文献   

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号