首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 27 毫秒
1.
Similarity search is important in information retrieval applications where objects are usually represented as vectors of high dimensionality. This leads to the increasing need for supporting the indexing of high-dimensional data. On the other hand, indexing structures based on space partitioning are powerless because of the well-known “curse of dimensionality”. Linear scan of the data with approximation is more efficient in the high-dimensional similarity search. However, approaches so far have concentrated on reducing I/O, and ignored the computation cost. For an expensive distance function such as L p norm with fractional p, the computation cost becomes the bottleneck. We propose a new technique to address expensive distance functions by “indexing the function” by pre-computing some key values of the function once. Then, the values are used to develop the upper/lower bounds of the distance between a data vector and the query vector. The technique is extremely efficient since it avoids most of the distance function computations; moreover, it does not involve any extra secondary storage because no index is constructed and stored. The efficiency is confirmed by cost analysis, as well as experiments on synthetic and real data.  相似文献   

2.
Tables are two-dimensional arrays given in row-major order. Such data have unique features that could be exploited for effective compression. For example, tables often represent database files with rows as records so certain columns or fields in a table may have few distinct values. This means that simply transposing the data can make it compress better. Further, a large source of information redundancy in a table is the correlation among columns representing related types of data. This paper formalizes the notion of column dependency as a way to capture this information redundancy across columns and discusses how to automatically compute and use it to substantially improve table compression.  相似文献   

3.
针对查询统计的自身特点,通过在文件组织时把需要统计的相关项的记录聚簇存储在块中,同时把数据量非常大的统计表按一定规则分成若干小表,并在小表上建立索引,这样不仅能减少物理磁盘I/O次数,还能减少扫描时间达到优化目的。这种方法能够提高查询统计效率,并在具体的项目中获得了成功。  相似文献   

4.
Estimating the selectivity of multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper, we consider the following problem: given a table of d attributes whose domain is the real numbers and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. The simplest approach to tackle this problem is to assume that the attributes are independent. More accurate estimators try to capture the joint data distribution of the attributes. In databases, such estimators include the construction of multidimensional histograms, random sampling, or the wavelet transform. In statistics, kernel estimation techniques are being used. Many traditional approaches assume that attribute values come from discrete, finite domains, where different values have high frequencies. However, for many novel applications (as in temporal, spatial, and multimedia databases) attribute values come from the infinite domain of real numbers. Consequently, each value appears very infrequently, a characteristic that affects the behavior and effectiveness of the estimator. Moreover, real-life data exhibit attribute correlations that also affect the estimator. We present a new histogram technique that is designed to approximate the density of multidimensional datasets with real attributes. Our technique defines buckets of variable size and allows the buckets to overlap. The size of the cells is based on the local density of the data. The use of overlapping buckets allows a more compact approximation of the data distribution. We also show how to generalize kernel density estimators and how to apply them to the multidimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets. The experimental results show that the proposed techniques behave more accurately in high dimensionalities than previous approaches.Received: 30 January 2001, Accepted: 9 June 2003, Published online: 4 March 2004Edited by: Y. IoannidisDimitrios Gunopulos: Supported by NSF ITR-0220148, NSF IIS-9907477 CAREER Award, NSF IIS-9984729, and NRDRP.George Kollios: Supported by NSF IIS-0133825 CAREER Award.Vassilis J. Tsotras: Supported by NSF IIS-9907477 and the US Dept. of Defense.  相似文献   

5.
针对现有键值数据库存储系统缺乏热点意识,导致系统在高度倾斜的工作负载下性能较差且不可靠,提出了一种自适应热点感知哈希索引模型,该模型基于key值摘要信息实现了一个高性能哈希表。首先,利用key的摘要信息代替key值,压缩key的存储空间,优化哈希表中桶的数据结构;其次,利用CPU的数据级并行技术以及CPU cache line,对哈希表的探查操作进行优化;最后,为解决摘要信息导致key值无法精准比较,需要额外磁盘I/O的问题,设计了一种自适应key值调度算法,该算法根据当前可用内存大小、哈希索引负载以及访问热点情况动态地调整key值的存储位置。在YCSB仿真数据集上进行了实验,实验表明,相较于最先进的哈希表,自适应热点感知哈希索引在相同内存使用率的情况下,将速度提升至1.2倍。  相似文献   

6.
In this paper, we propose a graph-based cluster-sequencing method to minimize the I/O cost in spatial join processing. We first define the maximum overlapping (MO) order in a graph, proving that the problem of finding an MO order in a graph is NP-complete. Then, we propose an algorithm to find an approximation to MO order in a graph. We also prove that the approximation to MO order obtained from our method is close to the optimal result. Simulations have been conducted to demonstrate the saving of I/O cost in spatial join by using our method.  相似文献   

7.
Modelling system with hysteresis has received considerable attention recently due to the increasing accurate requirement in engineering applications. The classical Preisach model (CPM) is the most popular model to demonstrate hysteresis which can be represented by infinite but countable first-order reversal curves (FORCs). The usage of look-up tables is one way to approach the CPM in actual practice. The data in those tables correspond with the samples of a finite number of FORCs. This approach, however, faces two major problems: firstly, it requires a large amount of memory space to obtain an accurate prediction of hysteresis; secondly, it is difficult to derive efficient ways to modify the data table to reflect the timing effect of elements with hysteresis. To overcome, this article proposes the idea of using a set of polynomials to emulate the CPM instead of table look-up. The polynomial approximation requires less memory space for data storage. Furthermore, the polynomial coefficients can be obtained accurately by using the least-square approximation or adaptive identification algorithm, such as the possibility of accurate tracking of hysteresis model parameters.  相似文献   

8.
In online automotive applications it is common to use look-up tables, or maps, to describe nonlinearities in component models that are to be valid over large operating ranges. If the component characteristics change with aging or wear, these look-up tables must be updated online. For 2-D look-up tables, the existing methods in the literature only adapt the observable parameters in the look-up table, which means that parameters in operation points that have not been visited for a long time may be far from their true values. In this work, correlations between different operating points are used to also update non-observable parameters of the look-up table. The method is applied to Open Circuit Voltage (OCV) curves for aged battery cells. From laboratory experimental data it is demonstrated that the proposed method can significantly reduce the average deviation from an aged OCV-curve compared to keeping the OCV-curve from the beginning of the cell’s life, both for observable and non-observable parameters.  相似文献   

9.
The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.  相似文献   

10.
In this paper, we present a novel multiple phase I/O collective technique for generic block-cyclic distributions. The I/O technique is divided into two stages: inspector and executor. During the inspector stage, the communication pattern is computed and the required datatypes are automatically generated. This information is used during the executor stage in performing the communication and file accesses. The two stages are decoupled, so that for repetitive file access patterns, the computations from the inspector stage can be performed once and reused several times by the executor. This strategy allows to amortize the inspector cost over several I/O operations. In this paper, we evaluate the performance of multiple phase I/O collective technique and we compare it with other state of the art approaches. Experimental results show that for small access granularities, our method outperforms in the large majority of cases other parallel I/O optimizations techniques.
Jesús CarreteroEmail:
  相似文献   

11.

Rough set theory (RS), introduced by Zdzislaw Pawlak in the early 1980s, is a methodology that concerned with the classificatory analysis of imprecise, uncertain or incomplete information or knowledge expressed in terms of data acquired from experiences or observations. It has the ability to distinguish between object and reason about the objects in the universe in which objects are perceived through the information that is available about them through the values for a predetermined set of attribute. The main advantage of RS is that it requires no additional information to the data represented in table. On the other hand, Supervised Neural Network learns by abstracting a mapping function from the training data for classification purposes. However the drawback of using a supervised neural network is that a large amount of training data must be provided for it to obtain an accurate mapping function. The problem is further aggravated if the data are in the continuous form (real values). Thus, in this paper we overcome the problem by transforming the training data in the continuous form into discrete values using Rough Sets theory and Boolean Reasoning technique. Here, global shape features are chosen to represent the logo images. The invariant features representing logo images are obtained by using the Geometric Invariant Moment Technique (Hu, 1962). The classification results prove that discretization using Rough Sets and Boolean Reasoning can reduce the training cycle and significantly increase the accuracy of the classification of logo images.  相似文献   

12.
针对大型数据库中高维稀疏关系表空字段对存储空间的占用问题,通过利用传统行存储数据库模拟列式存储数据库的工作原理,设计了一种基于分片的数据库结构。通过实验分析,数据库存储空间比原始模式降低了27.42%左右。在对高维稀疏数据中五个字段进行查询时,I/O数据块个数降低至原始模式的35.27%,对高维稀疏数据中四个字段进行查询时,I/O数据块个数降低至原始模式的28.22%,而随着字段的减少I/O数据块仍会进一步减少,从而提高了数据库的访问效率。  相似文献   

13.
We present a technique implementing space-variant filtering of an image, with kernels belonging to a given family, in time independent of the size and shape of the filter kernel support. The essence of our method is efficient approximation of these kernels, belonging to an infinite family governed by a small number of parameters, as a linear combination of a small number k of “basis” kernels. The accuracy of this approximation increases with k, and requires O(k) storage space. Any kernel in the family may be applied to the image in O(k) time using precomputed results of the application of the basis kernels. Performing linear combinations of these values with appropriate coefficients yields the desired result. A trade off between algorithm efficiency and approximation quality is obtained by adjusting k. The basis kernels are computed using singular value decomposition, distinguishing this from previous techniques designed to achieve a similar effect. We illustrate by applying our methods to the family of elliptic Gaussian kernels, a popular choice for filtering warped images.  相似文献   

14.
赵煌 《微机发展》1996,6(2):20-22
本文研究了所涉及的表处理系统的表的构造形式和一般模型,给出设计原理和表的语句形式,为广大计算机编程人员提供一种I/O数据的新方法.  相似文献   

15.
A Fortan subroutine calculates the least squares approximation to n data values containing random errors subject to non-negative second divided differences (convexity). The method employs a dual active set quadratic programming technique that allows several concavities of an iterate to be corrected simultaneously, which is a distinctive feature of this calculation. A B-spline representation of the iterates reduces each active set calculation to an unconstrained minimization with fewer variables that requires only O(n) computer operations. Details in these techniques including the data structure that establishes the implementation of the method are specified. Numerical testing on a variety of data sets indicates that the subroutine is particularly efficient, terminating after a small number of active set changes, the subroutine being suitable for large numbers of data. A numerical example and its output is provided to help the use of the software.  相似文献   

16.
An approximation technique to solve central server queueing models with a priority dispatching rule is described in this paper. This technique reduces the problem size by aggregating I/O activities and combining different classes into a smaller number of classes, so that the reduced problem can be solved numerical examples considered, throughputs and mean CPU queue sizes calculated by the approximation differ from simulation results by less than 15 percent.  相似文献   

17.
将数据挖掘集成到关系数据库中,可以使数据挖掘技术直接利用关系数据库中的数据生成挖掘模型.以决策树技术为研究实例,通过把决策树算法重新设计为基于SQL的汇总和查询分析操作,提出一种新的基于SQL的决策树算法.同时,通过理论分析表明,在L/O性能方面,文中算法比SPRINT算法要好.  相似文献   

18.
Correspondence analysis is an exploratory technique for analyzing the interaction in a contingency table. Tables with meaningful orders of the rows and columns may be analyzed using a model-based correspondence analysis that incorporates order constraints. However, if there exists a permutation of the rows and columns of the contingency table so that the rows are regression dependent on the columns and, vice versa, the columns are regression dependent on the rows, then both implied orders are reflected in the first dimension of the unconstrained correspondence analysis [Schriever, B.F., 1983. Scaling of order dependent categorical variables with correspondence analysis. International Statistical Review 51, 225-238]. Thus, using unconstrained correspondence analysis, we may still find that the data fit an ordinal stochastic model. Fit measures are formulated that may be used to verify whether the re-ordered contingency table is regression dependent in either the rows or columns. Using several data examples, it is shown that the fit indices may complement the usual geometric interpretation of the unconstrained correspondence analysis solution in low-dimensional space.  相似文献   

19.
Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range queries over two-dimensional datasets do not consider this property. In particular, these techniques do not consider the issue of sequential data placement when multiple I/O blocks need to be retrieved from a single device. In this paper, we reevaluate the optimal cost of range queries by declustering two-dimensional datasets over multiple devices, and prove that, in general, it is impossible to achieve the new optimal cost. This is because disks cannot facilitate two-dimensional sequential access which is required by the new optimal cost. Then we revisit the existing data allocation schemes under the new optimal cost, and show that none of them can achieve the new optimal cost. Fortunately, MEMS-based storage is being developed to reduce I/O cost. We first show that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks. Then we propose a new placement scheme that exploits the physical properties of MEMS-based storage to solve this problem. Our theoretical analysis and experimental results show that the new scheme achieves almost optimal I/O costs. Recommended by: Sunil Prabhakar This research is supported by the NSF grants under IIS-0220152 and CNF-0423336.  相似文献   

20.
持久性内存(persistmemory,PM)具有非易失、字节寻址、低时延和大容量等特性,打破了传统内外存之间的界限,对现有软件体系结构带来颠覆性影响.但是,当前PM硬件还存在着磨损不均衡、读写不对称等问题,特别是当跨NUMA(nonuniformmemoryaccess)节点访问PM时,存在着严重的I/O性能衰减问题.提出了一种NUMA感知的PM存储引擎优化设计,并应用到中兴新一代数据库系统GoldenX中,显著降低了数据库系统跨NUMA节点访问持久内存的开销.主要创新点包括:提出了一种DRAM+PM混合内存架构下跨NUMA节点的数据空间分布策略和分布式存取模型,实现了PM数据空间的高效使用;针对跨NUMA访问PM的高开销问题,提出了I/O代理例程访问方法,将跨NUMA访问PM开销转化为一次远程DRAM内存拷贝和本地访问PM的开销,设计了Cache Line Area (CLA)缓存页机制,缓解了I/O写放大问题,提升了本地访问PM的效率;扩展了传统表空间概念,让每个表空间既拥有独立的表数据存储,也拥有专门的WAL (write-ahead logging)日志存储,针对该分布式WA...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号