首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
集合类型是面向对象数据库和对象.关系数据库申的一种重要的数据类型,但是目前还缺少支持相关查询的有效的索引结构.提出了集合类型数据的一种索引结构:Settrie,这种结构是基于数据库中数据的公共前缀构造的,与Invertfile不同,在Settrie中重复的数据得以合理地组织,所以查询中访问的数据量比Invert file 小,提高了选择操作的性能.通过实验证明:这种方法相比Invert file提高了集合数据上的各种相交选择操作的性能,同时还讨论了时Settrie的几种优化方法.  相似文献   

2.
贾建伟  陈崚 《计算机科学》2016,43(6):254-256, 311
在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。  相似文献   

3.
讨论一种新的时态数据库索引技术TQD-tree.不同于常规的“代数”方式,TQD-tree基于“序关系”数学框架,但与“代数”方式一样能够实现“一次一集合”的数据操作.提出给定时间期间集合上线序划分概念并建立构建算法,在线序划分框架内,提出时态数据索TQD-tree,并研究了相应查询和更新算法.设计了仿真实验,对TQD-tree的性能进行了基本评估.  相似文献   

4.
在基于Map的发布/订购系统中,典型的事件匹配算法大都针对用户发布的每个事件寻找相匹配的订购,由于同一属性在不同事件中重复出现是一种普遍现象,当用户发布的事件数量较大时,相同的属性会与订购中的约束条件重复匹配,事件匹配存在着冗余.针对这种重复匹配问题,提出一种基于重复属性判定的事件匹配算法,该算法通过判定属性的重复关系,合并事件集合去除重复属性,并将订购集合组织为约束的多级索引结构以减少不必要的匹配,从而提高算法的匹配效率和可维护性.实验表明,当事件数量和订购数量较大时,该算法与同类算法相比具有更高的匹配效率.  相似文献   

5.
针对经典Apriori算法运行效率瓶颈问题,结合位集合占用内存空间少、逻辑运算快的特点,提出一种基于位集合的改进算法ABS.该算法通过一次数据库扫描,构建事务集位集合;采用位集合逻辑“与”运算和位统计操作确定频繁项集;改进连接和剪枝策略,采用位集合的逻辑“或”运算,统计运算结果重复出现次数,生成候选项集.挖掘实例数据库Northwind的频繁项集,对比Apriori算法,改进算法运行时间明显减少.该算法避免了数据库的重复扫描和繁琐的连接减枝操作,进一步提高了Apriori算法的运行效率.  相似文献   

6.
与传统的多边形集合求并算法相比,级联求并法基于STR-Tree索引优先对相邻的多边形进行求并,提高多边形集合的求并效率,但在数据密度高畸变区域的性能较差。针对该问题,提出一种基于格网的多边形集合级联求并算法。该算法利用格网划分多边形集合,缩小数据密度高畸变区域的范围,进一步提高级联求并法的效率。实验结果表明,该算法有效可行。  相似文献   

7.
基于Rabin指纹方法的URL去重算法   总被引:2,自引:1,他引:1  
针对现有URL检索算法占用存储空间较大,对重复率高的URL集合检索速度较慢,使Web Spider的效率降低的问题,提出了一种改进的URL去重算法.此算法基于Rabin指纹方法,以URL的指纹为地址,仅用一位数据标识一条URL,每次检索仅需对相应的一位数据的值做一次判断.实验表明,该算法能有效去除URL集合中重复的URL.提高检索速度.  相似文献   

8.
为解决在挖掘关联规则时存在大量冗余规则以及效率不高的问题,提出了一种基于事务ID集合的带约束的关联规则挖掘算法ACARMT.该算法结合了Separate算法以及基于数据垂直分布算法的优势,先根据约束条件产生基础频繁项目集,再利用事务ID集合存储项目集信息,从而避免重复扫描数据库,提高了挖掘效率.应用该算法挖掘实际的生殖健康数据的实验表明,在数据量大到超出基于数据垂直分布算法的使用范围时,该算法仍然有效,并且其效率优于Separate算法.  相似文献   

9.
提出一种基于免疫原理的人工免疫算法,用于模糊关联规则的挖掘.该算法通过借鉴生物免疫系统中的克隆选择原理来实施优化操作,它直接从给出的数据中,通过优化机制自动确定每个属性对应的模糊集合,使推导出的满足条件的模糊关联规则数目最多.将实际数据集和相关算法进行性能比较,实验结果表明了所提出算法的有效性.  相似文献   

10.
传统的图像关键点检测算法大都基于人工设计,不能适应场景变换,泛化性能较差.对此提出一种基于特征金字塔网络的图像关键点检测算法,通过融合网络中多尺度特征使得检测出的关键点具备尺度不变性,能够提取可重复的、鲁棒的关键点.为了提高算法的性能,并提出一种有效的方法产生训练数据集,训练数据集包括室内和室外的各种复杂场景.在多个公开数据集上对该算法进行测试,并与其他关键点检测算法进行对比,实验结果表明,该算法所提取的关键点在可重复率上有良好的表现.  相似文献   

11.
The use of a composite index known as the Bc-tree is presented; it is based on the concept of the B+-tree and serves the dual purpose of an attribute and join index and indirectly implements the link sets. The leaf node of the Bc-tree incorporates in each leaf node a reference to all tuples in the database that share common data values of a shared domain. In addition to improving the performance of the join and selection operations, the composite index facilitate the enforcement of structural integrity constraints. The author also presents the results of simulations that compare the performance of this approach with the simple join technique. The proposed approach, in the case of the simulated database, is seen to provide better performance for an average domain value size of greater than between 2 and 4 bytes  相似文献   

12.
The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies.  相似文献   

13.
Database design is based on the concept of data dependency, which is the interrelationship between data contained in various sets of attributes. In particular, functional, multivalued and acyclic join, dependencies play an essential role in the design of database schemas. The basic definition of an information metric and how this notion can be used in relational database are discussed in this paper. We use Shannon entropy as an information metric to quantify the information associated with a set of attributes. Thus, we prove that data dependencies can be formulated in terms of entropies. These formulas make the numerical computation and testing of data dependencies feasible. Among the different types of data dependencies, the acyclic join dependency is most important to the design of a relational database schema. The acyclic join dependency, with multivalued dependency as a special case, impose a constraint on the information-preserving decomposition of a relation. It is interesting that this constraint on a relation is similar to Gibbs' condition for separating physical systems in statistical mechanics. They both assert that entropy is preserved during the decomposition process. That is, the entropies of the corresponding set of attributes must satisfy the inclusion–exclusion identity.  相似文献   

14.
位图连接索引是数据仓库中一种有效的优化表间连接操作性能的索引机制。在大内存分析处理应用场景下,位图连接索引不仅需要权衡索引的内存和CPU开销,还需要进一步考虑处理器平台所带来的性能收益和数据访问延迟。提出了基于服务的位图连接索引管理机制,其主要特点体现在三个方面:独立于数据库的自管理索引机制;基于存储空间约束的TOP K关键字位图连接索引机制;处理器敏感(processor-conscious)的位图连接索引技术。索引服务将索引从数据库中内置的数据结构变成数据库外的索引服务层,通过对用户查询负载的分析模块和索引服务管理模块改变传统的由数据库管理员人工管理索引的模式,同时借助于协处理器和内存云技术提高索引服务的性能和灵活性。实验测试结果表明,索引服务机制能够有效地提高索引存储和访问效率,在通用GPU的强大并行处理能力的支持下,位图连接索引服务的性能和数据库整体查询处理性能都得到了显著的提升。  相似文献   

15.
Data warehouses are very large databases usually designed using the star schema. Queries defined on data warehouses are generally complex due to join operations involved. The performance of star schema queries in data warehouses is highly critical and its optimization is hard in general. Several query performance optimization methods exist, such as indexes and table partitioning. In this paper, we propose a new approach based on binary particle swarm optimization for solving the bitmap join index selection problem in data warehouses. This approach selects the optimal set of bitmap join indexes based on a mathematical cost model. Several experiments are performed to demonstrate the effectiveness of the proposed method on the bitmap join index selection problem. Further testing of the method is performed using a database environment specific cost function. The binary particle swarm optimization is found to be more effective than both the genetic algorithm and data mining based approaches.  相似文献   

16.
李星野  王书宁  岳占峰 《软件学报》2002,13(10):1915-1920
以抽象代数为工具,探索了全样本依赖与全连接依赖之间的关系.首先,分别在全样本依赖集和全连接依赖集上建立等价关系,这两种等价关系都将作用相同的依赖视为等价依赖. 然后证明了在这两个等价关系下的商集分别构成么半群,并且这两个么半群是同构的.这就等于证明了全样本依赖类本质上等同于全连接依赖类.最后给出了一个关于全无环连接依赖的有趣结果.有关结果可以在关系数据库的设计中发挥积极作用.  相似文献   

17.
Efficient cost models for spatial queries using R-trees   总被引:11,自引:0,他引:11  
Selection and join queries are fundamental operations in database management systems (DBMS). Support for nontraditional data, including spatial objects, in an efficient manner is of ongoing interest in database research. Toward this goal, access methods and cost models for spatial queries are necessary tools for spatial query processing and optimization. We present analytical models that estimate the cost (in terms of node and disk accesses) of selection and join queries using R-tree-based structures. The proposed formulae need no knowledge of the underlying R-tree structure(s) and are applicable to uniform-like and nonuniform data distributions. In addition, experimental results are presented which show the accuracy of the analytical estimations when compared to actual runs on both synthetic and real data sets  相似文献   

18.
XML文档近似连接操作是在两个XML文档集合中发现近似的XML文档,其在基于XML数据的信息集成、XML数据清洗等系统中有着广泛的应用.然而,目前XML文档近似连接操作的一个显著问题在于:当文档之间存在较大差异时,存在大量的重复计算,降低了处理效率.对于这个问题,提出了基于聚类的XML文档近似连接方法,基本思想是为每个XML文档建立一个索引,如果两个数据集中若干文档的索引较相似,可以把它们组成一簇,然后在每一簇中执行近似连接.而不在任何簇中的文档,则无需对其进行任何计算.实验结果表明,提出的方法在保证正确率的前提下具有高效性.  相似文献   

19.
Object database management systems (ODBMSs) are now established as the database management technology of choice for a range of challenging data intensive applications. Furthermore, the applications associated with object databases typically have stringent performance requirements, and some are associated with very large data sets. An important feature for the performance of object databases is the speed at which relationships can be explored. In queries, this depends on the effectiveness of different join algorithms into which queries that follow relationships can be compiled. This paper presents a performance evaluation of the Polar parallel object database system, focusing in particular on the performance of parallel join algorithms. Polar is a parallel, shared‐nothing implementation of the Object Database Management Group (ODMG) standard for object databases. The paper presents an empirical evaluation of queries expressed in the ODMG Query Language (OQL), as well as a cost model for the parallel algebra that is used to evaluate OQL queries. The cost model is validated against the empirical results for a collection of queries using four different join algorithms, one that is value based and three that are pointer based. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

20.
In a distributed spatial database system, a user may issue a query that relates two spatial relations not stored at the same site. Because of the sheer volume and complexity of spatial data, spatial joins between two spatial relations at different sites are expensive in terms of computational and transmission costs. In this paper, we address the problems of processing spatial joins in a distributed environment. We propose a semijoin-like operator, called the spatial semijoin, to prune away objects that do not contribute to the join result. This operator also reduces both the transmission and local processing costs for a later join operation. However, the cost of the elimination process must be taken into account, and we consider approaches to minimize these overheads. We also study and compare two families of distributed join algorithms that are based on the spatial semijoin operator. The first is based on multi-dimensional approximations obtained from an index such as the R-tree, and the second is based on single-dimensional approximations obtained from object mapping. We have conducted experiments on real data sets and report the results in this paper  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号