首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
In this paper, we present new algorithms for dynamically computing quantiles of a relation subject to insert as well as delete operations. At the core of our algorithms lies a small-space multiresolution representation of the underlying data distribution based on random subset sums or RSSs. These RSSs are updated with every insert and delete operation. When quantiles are demanded, we use these RSSs to estimate quickly, without having to access the data, all the quantiles, each guaranteed to be accurate to within user-specified precision. While quantiles have found many uses in databases, in this paper, our focus is primarily on network management applications that monitor the distribution of active sessions in the network. Our examples are drawn both from the telephony and the IP network, where the goal is to monitor the distribution of the length of active calls and IP flows, respectively, over time. For such applications, we propose a new type of histogram that uses RSSs for summarizing the dynamic parts of the distributions while other parts with small volume of sessions are approximated using simple counters.  相似文献   

2.
3.
Quantile regression is a wide spread regression technique which allows to model the entire conditional distribution of the response variable. A natural extension to the case of censored observations has been introduced using a reweighting scheme based on the Kaplan-Meier estimator. The same ideas can be applied to depth quantiles. This leads to regression quantiles for censored data which are robust to both outliers in the predictor and the response variable. For their computation, a fast algorithm over a grid of quantile values is proposed. The robustness of the method is shown in a simulation study and on two real data examples.  相似文献   

4.
5.
6.
This paper proposes a new method of estimating extreme quantiles of heavy-tailed distributions for massive data. The method utilizes the Peak Over Threshold (POT) method with generalized Pareto distribution (GPD) that is commonly used to estimate extreme quantiles and the parameter estimation of GPD using the empirical distribution function (EDF) and nonlinear least squares (NLS). We first estimate the parameters of GPD using EDF and NLS and then, estimate multiple high quantiles for massive data based on observations over a certain threshold value using the conventional POT. The simulation results demonstrate that our parameter estimation method has a smaller Mean square error (MSE) than other common methods when the shape parameter of GPD is at least 0. The estimated quantiles also show the best performance in terms of root MSE (RMSE) and absolute relative bias (ARB) for heavy-tailed distributions.  相似文献   

7.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

8.
Many real-world applications reveal difficulties in learning classifiers from imbalanced data. Although several methods for improving classifiers have been introduced, the identification of conditions for the efficient use of the particular method is still an open research problem. It is also worth to study the nature of imbalanced data, characteristics of the minority class distribution and their influence on classification performance. However, current studies on imbalanced data difficulty factors have been mainly done with artificial datasets and their conclusions are not easily applicable to the real-world problems, also because the methods for their identification are not sufficiently developed. In our paper, we capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. First, we confirm their occurrence in real data by exploring multidimensional visualizations of selected datasets. Then, we introduce a method for an identification of these types of examples, which is based on analyzing a class distribution in a local neighbourhood of the considered example. Two ways of modeling this neighbourhood are presented: with k-nearest examples and with kernel functions. Experiments with artificial datasets show that these methods are able to re-discover simulated types of examples. Next contributions of this paper include carrying out a comprehensive experimental study with 26 real world imbalanced datasets, where (1) we identify new data characteristics basing on the analysis of types of minority examples; (2) we demonstrate that considering the results of this analysis allow to differentiate classification performance of popular classifiers and pre-processing methods and to evaluate their areas of competence. Finally, we highlight directions of exploiting the results of our analysis for developing new algorithms for learning classifiers and pre-processing methods.  相似文献   

9.
10.
11.
Sensory evaluation has been widely applied in different industrial fields especially for quality inspection, product design and marketing. Classically, factorial multivariate methods are the only tool for analyzing and modeling sensory data provided by experts, panelists or consumers. These methods are efficient for solving some problems but sometimes cause important information lost. In this situation, new methods based on intelligent techniques such as fuzzy logic, neural networks, data aggregation, classification, clustering have been applied for solving uncertainty and imprecision related to sensory evaluation. These new methods can be used together with the classical ones in a complementary way for obtaining relevant information from sensory data. This paper outlines the general background of sensory evaluation and the corresponding industrial interests and explicitly indicates some orientations for further development by IT researchers.  相似文献   

12.
13.
王立杰  李萌  蔡斯博  李戈  谢冰  杨芙清 《软件学报》2012,23(6):1335-1349
随着Web服务技术的不断成熟和发展,互联网上出现了大量的公共Web服务.在使用Web服务开发软件系统的过程中,其文本描述信息(例如简介和使用说明等)可以帮助服务消费者直观有效地识别和理解Web服务并加以利用.已有的研究工作大多关注于从Web服务的WSDL文件中获取此类信息进行Web服务的发现或检索,调研发现,互联网上大部分Web服务的WSDL文件中普遍缺少甚至没有此类信息.为此,提出一种基于网络信息搜索的从WSDL文件之外的信息源为Web服务扩充文本描述信息的方法.从互联网上收集包含目标Web服务特征标识的相关网页,基于从网页中抽取出的信息片段,利用信息检索技术计算信息片段与目标Web服务的相关度,并选取相关度较高的文本片段为Web服务扩充文本描述信息.基于互联网上的真实数据进行的实验,其结果表明,可为约51%的互联网上的Web服务获取到相关网页,并为这些Web服务中约88%扩充文本描述信息.收集到的Web服务及其文本描述信息数据均已公开发布.  相似文献   

14.
随着多数生物基因组测序工作的完成,基因识别就显得尤为重要.CpG岛在基因组中有着重要的生物学意义,因此识别CpG岛将有助于基因的识别.目前已经构建的一些识别CpG岛的位置的模型大都存在标注偏差、需要独立假设等缺点,为此提出一种基于条件随机场(CRFs)模型的CpG岛的位置识别的新方法.该方法将识别CpG岛的位置的问题转化为序列标记问题,并根据CpG岛的位置的性质设计了相应的模型构建、训练以及解码的算法.利用本文算法可以对输入序列确定最有可能的标注序列,从而识别CpG岛的位置.通过对标准数据库的数据进行测试,其实验结果表明本文算法是可行的、高效的,比HMM方法有更高的准确率.  相似文献   

15.
16.
BIRCH: A New Data Clustering Algorithm and Its Applications   总被引:14,自引:0,他引:14  
Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.  相似文献   

17.
Mining constrained gradients in large databases   总被引:1,自引:0,他引:1  
Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematically analyze data in MDS, an interesting notion, called "cubegrade" was recently introduced by Imielinski et al. [2002], which focuses on the notable changes in measures in MDS by comparing a cell (which we refer to as probe cell) with its gradient cells, namely, its ancestors, descendants, and siblings. We call such queries gradient analysis queries (GQs). Since an MDS can contain billions of cells, it is important to answer GQs efficiently. We focus on developing efficient methods for mining GQs constrained by certain (weakly) antimonotone constraints. Instead of conducting an independent gradient-cell search once per probe cell, which is inefficient due to much repeated work, we propose an efficient algorithm, LiveSet-Driven. This algorithm finds all good gradient-probe cell pairs in one search pass. It utilizes measure-value analysis and dimension-match analysis in a set-oriented manner, to achieve bidirectional pruning between the sets of hopeful probe cells and of hopeful gradient cells. Moreover, it adopts a hypertree structure and an H-cubing method to compress data and to maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable. In addition to data cubes, we extend our study to another important scenario: mining constrained gradients in transactional databases where each item is associated with some measures such as price. Such transactional databases can be viewed as sparse MDSs where items represent dimensions, although they have significantly different characteristics than data cubes. We outline efficient mining methods for this problem.  相似文献   

18.
High-throughput experiments have become more and more prevalent in biomedical research. The resulting high-dimensional data have brought new challenges. Effective data reduction, summarization and visualization are important keys to initial exploration in data mining. In this paper, we introduce a visualization tool, namely a quantile map, to present information contained in a probabilistic distribution. We demonstrate its use as an effective visual analysis tool through the application of a tandem mass spectrometry data set. Information of quantiles of a distribution is presented in gradient colors by concentric doughnuts. The width of the doughnuts is proportional to the Fisher information of the distribution to present unbiased visualization effect. A parametric empirical Bayes (PEB) approach is shown to improve the simple maximum likelihood estimate (MLE) approach when estimating the Fisher information. In the motivating example from tandem mass spectrometry data, multiple probabilistic distributions are to be displayed in two-dimensional grids. A hierarchical clustering to reorder rows and columns and a gradient color selection from a Hue-Chroma-Luminance model, similar to that commonly applied in heatmaps of microarray analysis, are adopted to improve the visualization. Both simulations and the motivating example show superior performance of the quantile map in summarization and visualization of such high-throughput data sets.  相似文献   

19.
Energy efficiency is recognized as a critical problem in wireless networks. Many routing schemes have been proposed for finding energy efficient routing paths with a view to extend lifetime of the networks – however it has been observed that the energy efficient path depletes quickly. Further, an unbalanced distribution of energy among the nodes may cause early death of nodes as well as network. Hence, balancing the energy distribution is a challenging area of research in wireless networks. In this paper we propose an energy efficient scheme that considers the node cost of nodes for relaying the data packets to the sink. The node cost considers both the remaining energy of the node as well as energy efficiency. Using this parameter, an energy efficient routing algorithm is proposed which balances the data traffic among the nodes and also prolongs the network lifetime. Simulation shows that proposed routing scheme improves energy efficiency and network lifetime than widely used methods viz., Shortest Path Tree (SPT) and Minimum Spanning Tree (MST) based PEDAP, Distributed Energy Balanced Routing (DEBR) and Shortest Path Aggregation Tree Based Routing Protocol.  相似文献   

20.
In this paper we propose a new four-parameters distribution with increasing, decreasing, bathtub-shaped and unimodal failure rate, called as the exponentiated Weibull–Poisson (EWP) distribution. The new distribution arises on a latent complementary risk problem base and is obtained by compounding exponentiated Weibull (EW) and Poisson distributions. This distribution contains several lifetime sub-models such as: generalized exponential-Poisson (GEP), complementary Weibull–Poisson (CWP), complementary exponential-Poisson (CEP), exponentiated Rayleigh–Poisson (ERP) and Rayleigh–Poisson (RP) distributions.We obtain several properties of the new distribution such as its probability density function, its reliability and failure rate functions, quantiles and moments. The maximum likelihood estimation procedure via an EM-algorithm is presented in this paper. Sub-models of the EWP distribution are studied in details. In the end, applications to two real data sets are given to show the flexibility and potentiality of the new distribution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号