期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compressed histograms with arbitrary bucket layouts for selectivity estimation

Dennis Fuchs Zhen He Byung Suk Lee 《Information Sciences》2007,177(3):680-702

Selectivity estimation is an important step of query optimization in a database management system, and multi-dimensional histogram techniques have proved promising for selectivity estimation. Recent multi-dimensional histogram techniques such as GenHist and STHoles use an arbitrary bucket layout. This layout has the advantage of requiring a smaller number of buckets to model tuple densities than those required by the traditional grid or recursive layouts. However, the arbitrary bucket layout brings an inherent disadvantage of requiring more memory to store each bucket location information. This diminishes the advantage of requiring fewer buckets and, therefore, has an adverse effect on the resulting selectivity estimation accuracy. To our knowledge, however, no existing histogram-based technique with arbitrary layout addresses this issue. In this paper, we introduce the idea of bucket location compression and then demonstrate its effectiveness for improving selectivity estimation accuracy by proposing the STHoles+ technique. STHoles+ extends STHoles by quantizing each coordinate of a bucket relative to the coordinate of the smallest enclosing bucket. This quantization increases the number of histogram buckets that can be stored in the histogram. Our quantization scheme allows STHoles+ to trade precision of histogram bucket locations for storing more buckets. Experimental results show that STHoles+ outperforms STHoles on various data distributions, query distributions, and other factors such as available memory size, quantization resolution, and dimensionality of the data space. 相似文献

2.

Compressed hierarchical binary histograms for summarizing multi-dimensional data

Filippo Furfaro Giuseppe M. Mazzeo Domenico Saccà Cristina Sirangelo 《Knowledge and Information Systems》2008,15(3):335-380

Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation. 相似文献

3.

Analysis of linear probing with buckets

Haim Mendelson 《Information Systems》1983,8(3):207-216

Consider a random-access file which consists of a given number of buckets. Each bucket contains a fixed number of slots. Storage and retrieval of records are performed using linear probing. The probabilities underlying the behavior of this addressing system are determined, and the relevant performance measures are derived. 相似文献

4.

限定误差的直方图 总被引：6，自引：0，他引：6

吴胜利《计算机学报》1998,21(8):704-712

限定误差的直方图方法以任意给定的误差作为前提，生成满足要求的直方图。本文在作者已有工作的基础上，进一步探讨生成限定误差直方图的方法，以适用于等值和范围两种查询，并进行了大量的模拟实验。实验结果表明，限定误差的直方图不仅估算精确，肯定能满足预定的要求，而且算法简单，实现效率高，具有良好的应用前景。相似文献

5.

Block matching algorithm for motion estimation based on Artificial Bee Colony (ABC)

Erik Cuevas Daniel Zaldívar Marco Pérez-Cisneros Humberto Sossa Valentín Osuna 《Applied Soft Computing》2013,13(6):3047-3059

Block matching (BM) motion estimation plays a very important role in video coding. In a BM approach, image frames in a video sequence are divided into blocks. For each block in the current frame, the best matching block is identified inside a region of the previous frame, aiming to minimize the sum of absolute differences (SAD). Unfortunately, the SAD evaluation is computationally expensive and represents the most consuming operation in the BM process. Therefore, BM motion estimation can be approached as an optimization problem, where the goal is to find the best matching block within a search space. The simplest available BM method is the full search algorithm (FSA) which finds the most accurate motion vector through an exhaustive computation of SAD values for all elements of the search window. Recently, several fast BM algorithms have been proposed to reduce the number of SAD operations by calculating only a fixed subset of search locations at the price of poor accuracy. In this paper, a new algorithm based on Artificial Bee Colony (ABC) optimization is proposed to reduce the number of search locations in the BM process. In our algorithm, the computation of search locations is drastically reduced by considering a fitness calculation strategy which indicates when it is feasible to calculate or only estimate new search locations. Since the proposed algorithm does not consider any fixed search pattern or any other movement assumption as most of other BM approaches do, a high probability for finding the true minimum (accurate motion vector) is expected. Conducted simulations show that the proposed method achieves the best balance over other fast BM algorithms, in terms of both estimation accuracy and computational cost. 相似文献

6.

Modeling splits in file structures

Ricardo A. Baeza-Yates 《Acta Informatica》1989,26(4):349-362

Summary We analyze the expected behaviour of file structures where splits are used to handle overflows. Two cases are analyzed. The first model is of a file with an index on top of the data structure. We analyze the effect of unbalanced splits, and the effect of splitting in more than two buckets. The second model is of an ideal hash file, in which the probability of insertion remains the same for every bucket, regardless of how many times the bucket has been split. The result is an upper bound in any dynamic hashing method that uses splitting and does not allow overflow records. In both cases, the effect of using partial expansions is included.This work was also supported by a scholarship from the Institute for Computer Research and by the University of Chile, Santiago, Chile 相似文献

7.

On the File Design Problem for Partial Match Retrieval

《IEEE transactions on pattern analysis and machine intelligence》1985,(2):213-222

相似文献

8.

Data space mapping for efficient I/O in large multi-dimensional databases

Hakan Ferhatosmanoglu Aravind Ramachandran Divyakant Agrawal Amr El Abbadi 《Information Systems》2007

In this paper, we propose data space mapping techniques for storage and retrieval in multi-dimensional databases on multi-disk architectures. We identify the important factors for an efficient multi-disk searching of multi-dimensional data and develop secondary storage organization and retrieval techniques that directly address these factors. We especially focus on high dimensional data, where none of the current approaches are effective. In contrast to the current declustering techniques, storage techniques in this paper consider both inter- and intra-disk organization of the data. The data space is first partitioned into buckets, then the buckets are declustered to multiple disks while they are clustered in each disk. The queries are executed through bucket identification techniques that locate the pages. One of the partitioning techniques we discuss is especially practical for high dimensional data, and our disk and page allocation techniques are optimal with respect to number of I/O accesses and seek times. We provide experimental results that support our claims on two real high dimensional datasets. 相似文献

9.

Optimization heuristics for determining internal rating grading scales

M. Lyra S. Paterlini P. Winker 《Computational statistics & data analysis》2010,54(11):2693-2706

Basel II imposes regulatory capital on banks related to the default risk of their credit portfolio. Banks using an internal rating approach compute the regulatory capital from pooled probabilities of default. These pooled probabilities can be calculated by clustering credit borrowers into different buckets and computing the mean PD for each bucket. The clustering problem can become very complex when Basel II regulations and real-world constraints are taken into account. Search heuristics have already proven remarkable performance in tackling this problem. A Threshold Accepting algorithm is proposed, which exploits the inherent discrete nature of the clustering problem. This algorithm is found to outperform alternative methodologies already proposed in the literature, such as standard k-means and Differential Evolution. Besides considering several clustering objectives for a given number of buckets, we extend the analysis further by introducing new methods to determine the optimal number of buckets in which to cluster banks’ clients. 相似文献

10.

基于MIT规则的自适应扩展集员估计方法 总被引：2，自引：0，他引：2

宋大雷吴冲齐俊桐韩建达《自动化学报》2012,38(11):1847-1860

用于非线性椭球估计的自适应扩展集员(Adaptive extended set-membership filter, AESMF)算法在实际应用中存在着过程噪声设定椭球与真实噪声椭球失配的问题, 导致滤波器的估计出现偏差甚至发散. 本文提出了一种基于MIT规则过程噪声椭球最优化的自适应扩展集员估计算法(MIT-AESMF), 用于解决非线性系统时变状态和参数的联合估计和定界中过程噪声无法精确建模问题的新算法. 本算法通过MIT优化规则,在线计算使一步预测偏差包络椭球最小化的过程噪声包络椭球, 以此保证滤波器健康指标满足有效条件; 最后, 采用地面移动机器人状态和动力学参数联合估计验证了所提出方法的有效性. 相似文献

11.

Efficient construction of histograms for multidimensional data using quad-trees

Yohan J. Roh Jae Ho Kim 《Decision Support Systems》2011,52(1):82-94

Histograms can be useful in estimating the selectivity of queries in areas such as database query optimization and data exploration. In this paper, we propose a new histogram method for multidimensional data, called the Q-Histogram, based on the use of the quad-tree, which is a popular index structure for multidimensional data sets. The use of the compact representation of the target data obtainable from the quad-tree allows a fast construction of a histogram with the minimum number of scanning, i.e., only one scanning, of the underlying data. In addition to the advantage of computation time, the proposed method also provides a better performance than other existing methods with respect to the quality of selectivity estimation. We present a new measure of data skew for a histogram bucket, called the weighted bucket skew. Then, we provide an effective technique for skew-tolerant organization of histograms. Finally, we compare the accuracy and efficiency of the proposed method with other existing methods using both real-life data sets and synthetic data sets. The results of experiments show that the proposed method generally provides a better performance than other existing methods in terms of accuracy as well as computational efficiency. 相似文献

12.

Locality preserving hashing for fast image search: theory and applications

Maojing Niu Liang Wu Junwei Zeng 《人工智能实验与理论杂志》2013,25(2):349-359

Abstract

State-of-the-art hashing methods, such as the kernelised locality-sensitive hashing and spectral hashing, have high algorithmic complexities to build the hash codes and tables. Our observation from the existing hashing method is that, putting two dissimilar data points into the same hash bucket only reduces the efficiency of the hash table, but it does not hurt the query accuracy. Whereas putting two similar data points into different hash buckets will reduce the correctness (i.e. query accuracy) of a hashing method. Therefore, it is much more important for a good hashing method to ensure that similar data points have high probabilities to be put to the same bucket, than considering those dissimilar data-point relations. On the other side, attracting similar data points to the same hash bucket will naturally suppress dissimilar data points to be put into the same hash bucket. With this locality-preserving observation, we naturally propose a new hashing method called the locality-preserving hashing, which builds the hash codes and tables with much lower algorithmic complexity. Experimental results show that the proposed method is very competitive in terms of the training time spent for large data-sets among the state of the arts, and with reasonable or even better query accuracy. 相似文献

13.

多核平台并行单源最短路径算法 总被引：1，自引：0，他引：1

下载免费PDF全文

黄跃峰钟耳顺《计算机工程》2012,38(3):1-3

提出一种多核平台并行单源最短路径算法。采用与Δ-Stepping算法相似的并行策略,通过多个子线程对同一个桶中的弧段进行并行松弛,利用主线程控制串行搜索中桶的序列。实验结果表明,该算法求解全美单源最短路径的时间约为4 s,与使用相同代码实现的串行算法相比,加速比更高。相似文献

14.

多维数值型敏感属性数据的个性化隐私保护方法

张梅舒徐雅斌《计算机应用》2020,40(2):491-496

为了解决多维数值型敏感属性数据隐私保护方法中存在的准标识符属性信息损失大,以及不能满足用户对数值型敏感属性重要性排序的个性化需求问题,提出一种基于聚类和加权多维桶分组（MSB）的个性化隐私保护方法。首先,根据准标识符的相似程度,将数据集划分成若干准标识符属性值相近的子集;然后,考虑到用户对敏感属性的敏感程度不同,将敏感程度和多维桶的桶容量用于计算加权选择度和构建加权多维桶;最后,依此对数据进行分组和匿名化处理。选用UCI的标准Adult数据集中的8个属性进行实验,并与基于聚类和多维桶的数据隐私保护方法MNSACM和基于聚类和加权多维桶分组的个性化隐私保护方法WMNSAPM进行对比。实验结果表明,所提方法整体较优,并且在减少信息损失和运行时间方面明显优于对比方法,提高了数据质量和运行效率。相似文献

15.

Bucket attack on numeric set watermarking model and safeguards

Gaurav Gupta Josef Pieprzyk Len Hamey 《Information Security Technical Report》2011,16(2):59-66

相似文献

16.

Complexity analysis for partitioning nearest neighbor searchingalgorithms

Zakarauskas P. Ozard J.M. 《IEEE transactions on pattern analysis and machine intelligence》1996,18(6):663-668

Presents cost estimates for finding the k-nearest neighbors to a test pattern according to a Minkowski p-metric, as a function of the size of the buckets in partitioning searching algorithms. The asymptotic expected number of operations to find the nearest neighbor is presented as a function of the average number of patterns per bucket n and is shown to contain a global minimum 相似文献

17.

A novel sequential watermark detection model for efficient traceback of secret network attack flows

Xiaogang Wang Ming Yang Junzhou Luo 《Journal of Network and Computer Applications》2013,36(6):1660-1670

Network watermarking schemes have been proposed to trace secret network attack flows transferred through stepping stones as well as anonymous channels. However, most existing network flow watermark detection techniques focus on a fixed sample size of network data to achieve the required accuracy. Irrespective of the uncertainty or information content of successive observations, such detection techniques will result in low efficiency of watermark detection. We herein propose a novel sequential watermark detection model (SWDM) supporting three sequential detectors for efficient traceback of network attack flows. By exploiting the sequential probability ratio test approach, we first propose the intuitive paired-intervals-based optimum watermark detector (POWD) and the single-interval-based optimum watermark detector (SOWD) under the assumption of known parameters of the observed attack flow. We then propose the sequential sign watermark detector (SSWD) that operates on two-level quantized observations for nonparametric watermark detection. Based on our SWDM model, a statistical analysis of sequential detectors, with no assumptions or limitations concerning the distribution of the timing of packets, proves their effectiveness despite traffic timing perturbations. The experiments using a large number of synthetically-generated SSH traffic flows demonstrate that there is a significant advantage in using our sequential watermark detectors based on the proposed SWDM model over the existing fixed sample size watermark detector (FSWD). Compared to the FSWD detector, the POWD detector achieves almost 28% savings in the average number of packets. Especially, given the required probability of detection errors, the SOWD detector and the SSWD detector can achieve almost 47% and 29% savings, respectively, in the average number of required packets, thus resulting in not only guaranteed rates of detection errors but also high efficiency of flow traceback. 相似文献

18.

A multiple-level visual secret-sharing scheme without image size expansion 总被引：1，自引：0，他引：1

Yung-Fu Chen Ching-Chun Huang 《Information Sciences》2007,177(21):4696-4710

In traditional VSS schemes, the size of the share image is substantially expanded since each pixel of the secret image is mapped onto a block consisting of several pixels. In addition, the quality of the reconstructed secret image is normally degraded in contrast, especially for halftone images. This study proposes a VSS scheme that maps a block in a secret image onto one corresponding equal-sized block in each share image without image size expansion. Two types of techniques, including histogram width-equalization and histogram depth-equalization, are proposed to generate the corresponding share blocks containing multiple levels rather than two levels based on the density of black pixels on the blocks for a secret block. In the former technique, the gray-scale image histogram is obtained by uniformly splitting the range of the pixel gray levels in the secret image, while in the latter the buckets are created so that the area of each bucket is roughly constant by containing approximately the same number of pixels. The proposed schemes significantly improve the quality of the reconstructed secret image compared to several previous investigations. 相似文献

19.

The eternal sunshine of the sketch data structure

《Computer Networks》2008,52(17):3248-3257

In the past years there has been significant research on developing compact data structures for summarizing large data streams. A family of such data structures is the so-called sketches. Sketches bear similarities to the well-known Bloom filters [B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of ACM, 13 (7) (1970), 422–426] and employ hashing techniques to approximate the count associated with an arbitrary key in a data stream using fixed memory resources. One limitation of sketches is that when used for summarizing long data streams, they gradually saturate, resulting in a potentially large error on estimated key counts. In this work, we introduce two techniques to address this problem based on the observation that real-world data streams often have many transient keys that appear for short time periods and do not re-appear later on. After entering the data structure, these keys contribute to hashing collisions and thus reduce the estimation accuracy of sketches. Our techniques use a limited amount of additional memory to detect transient keys and to periodically remove their hashed values from the sketch. In this manner the number of keys hashed into a sketch decreases, and as a result the frequency of hashing collisions and the estimation error are reduced. Our first technique in effect slows down the saturation process of a sketch, whereas our second technique completely prevents a sketch from saturating.¹ We demonstrate the performance improvements of our techniques analytically as well as experimentally. Our evaluation results using real network traffic traces show a reduction in the collision rate ranging between 26.1% and 98.2% and even higher savings in terms of estimation accuracy compared to a state-of-the-art sketch data structure. To our knowledge this is the first work to look into the problem of improving the accuracy of sketches by mitigating their saturation process. 相似文献

20.

Histograms based on the minimum description length principle

Hai Wang Kenneth C. Sevcik 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(3):419-442

相似文献