期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

XHQE: A hybrid system for scalable selectivity estimation of XML queries

E.-S. M. El-Alfy S. Mohammed A. F. Barradah 《Information Systems Frontiers》2016,18(6):1233-1249

With the increasing popularity of XML applications in enterprise and big data systems, the use of efficient query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best possible query execution plan. In this work, we propose a novel selectivity estimator which is a hybrid of structural synopsis and statistics, called XHQE. The structural synopsis enhances the accuracy of estimation and the structural statistics makes it scalable to the allocated memory space. The structural synopsis is generated by labeling the nodes of the source XML dataset using a fingerprint function and merging subtrees with similar fingerprints (i.e. having similar structures). The generated structural synopsis and structural statistics are then used to estimate the selectivity of given queries. We studied the performance of the proposed approach using different types of queries and four benchmark datasets with different structural characteristics. We compared XHQE with existing algorithms such as Sampling, TreeSketch and one histogram-based algorithm. The experimental results showed that the XHQE is significantly better than other algorithms in terms of estimation accuracy and scalability for semi-uniform datasets. For non-uniform datasets, the proposed algorithm has comparable estimation accuracy to TreeSketch as the allocated memory size is highly reduced, yet the estimation data generation time of the proposed approach is much lower (e.g., TreeSketch took more than 50 times longer than that of the proposed approach for XMark dataset). Comparing to the histogram-based algorithm, our approach supports regular twig quires in addition to having higher accuracy when both run under similar memory constraints. 相似文献

2.

基于区域划分与降维的高维学习型索引

张少敏蔡盼李翠平陈红《软件学报》2023,34(5):2413-2426

在数据量与数据复杂度不断增加的时代,大数据处理与分析成为当前的热门研究内容,高维空间数据的使用越来越频繁,数据检索和访问速度成了衡量数据处理系统性能的重要指标.因此,如何设计实现一种高效的高维索引结构,提高查询访问速率、降低内存占用,变得至关重要.近年, Kraska等人提出了学习型索引的方法.实验证明该方法在真实数据集上表现良好.之后机器学习与深度学习在数据库系统中的运用越来越广泛.众多研究者尝试在高维数据上构建学习型索引,来提升高维数据的查询速度.但是目前的高维学习型索引采用的方法并不能将数据分布的信息有效利用起来,而且过于复杂的深度学习模型使得索引初始化开销过大.结合空间区域划分与降维两种技术,提出一种新颖的高维学习型索引.它能更有效地利用数据分布信息提高索引的查询效率,并利用多段线性模型在保证查找精确度的前提下尽可能减少索引初始化的开销.分别在随机生成的数据集和开源街区地图数据集上进行实验验证.结果表明,与现有的高维索引相比,其在索引构建、查询效率、以及内存占用方面都有显著提高. 相似文献

3.

球面上的K最近邻查询算法 总被引：1，自引：1，他引：0

下载免费PDF全文

张丽平李松郝晓红《计算机工程》2011,37(2):52-53

针对球面上数据对象点集的特征和K最近邻查询的需求,提出2种处理球面上K最近邻查询的算法:基于查询轴的K最近邻查询算法(PAM方法)和基于查询圆面的K最近邻查询算法(PCM方法).对2种算法进行实验比较,结果表明,PAM方法和PCM方法都适合处理球面上的最近邻查询问题,PAM方法在存储量和查询复杂度方面相对于PCM方法具... 相似文献

4.

Dynamic and fast processing of queries on large-scale RDF data

Pingpeng Yuan Changfeng Xie Hai Jin Ling Liu Guang Yang Xuanhua Shi 《Knowledge and Information Systems》2014,41(2):311-334

As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One of the grant challenges for managing these huge RDF data is how to execute RDF queries efficiently. In this paper, we address the query processing problems against the billion triple challenges. We first identify some causes for the problems of existing query optimization schemes, such as large intermediate results, initial query cost estimation errors. Then, we present our block-oriented dynamic query plan generation approach powered with pipelining execution. Our approach consists of two phases. In the first phase, a near-optimal execution plan for queries is chosen by identifying the processing blocks of queries. We group the join patterns sharing a join variable into building blocks of the query plan since executing them first provides opportunities to reduce the size of intermediate results generated. In the second phase, we further optimize the initial pipelining for a given query plan. We employ optimization techniques, such as sideways information passing and semi-join, to further reduce the size of intermediate results, improve the query processing cost estimation and speed up the performance of query execution. Experimental results on several RDF datasets of over a billion triples demonstrate that our approach outperforms existing RDF query engines that rely on dynamic programming based static query processing strategies. 相似文献

5.

Enhancing answer completeness of SPARQL queries via crowdsourcing

《Journal of Web Semantics》2017

Linked Open Data initiatives have encouraged the publication of large RDF datasets into the Linking Open Data (LOD) cloud, including DBpedia, YAGO, and Geo-Names. Despite the size of LOD datasets and the development of (semi-)automatic methods to create and link LOD data, these datasets may be still incomplete, negatively affecting thus accuracy of Linked Data processing techniques. We acquire query answer completeness by capturing knowledge collected from the crowd, and propose a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. Our system, HARE, implements these hybrid query processing techniques. HARE encompasses several features: (1) a completeness model for RDF that exploits the characteristics of RDF in order to estimate the completeness of an RDF dataset; (2) a crowd knowledge base that captures crowd answers about missing values in the RDF dataset; (3) a query engine that combines on-the-fly crowd knowledge and estimates provided by the RDF completeness model, to decide upon the sub-queries of a SPARQL query that should be executed against the dataset or via crowd computing to enhance query answer completeness; and (4) a microtask manager that exploits the semantics encoded in the dataset RDF properties, to crowdsource SPARQL sub-queries as microtasks and update the crowd knowledge base with the results from the crowd. Effectiveness and efficiency of HARE are empirically studied on a collection of 50 SPARQL queries against the DBpedia dataset. Experimental results clearly show that our solution accurately enhances answer completeness. 相似文献

6.

Improving performance by creating a native join-index for OLAP

Yansong Zhang Shan Wang Jiaheng Lu 《Frontiers of Computer Science in China》2011,5(2):236-249

The performance of online analytical processing (OLAP) is critical for meeting the increasing requirements of massive volume analytical applications. Typical techniques, such as in-memory processing, column-storage, and join indexes focus on high performance storage media, efficient storage models, and reduced query processing. While they effectively perform OLAP applications, there is a vital limitation: mainmemory database based OLAP (MMOLAP) cannot provide high performance for a large size data set. In this paper, we propose a novel memory dimension table model, in which the primary keys of the dimension table can be directly mapped to dimensional tuple addresses. To achieve higher performance of dimensional tuple access, we optimize our storage model for dimension tables based on OLAP query workload features. We present directly dimensional tuple accessing (DDTA) based join (DDTAJOIN), a technique to optimize query processing on the memory dimension table by direct dimensional tuple access. We also contribute by proposing an optimization of the predicate tree to shorten predicate operation length by pruning useless predicate processing. Our experimental results show that the DDTA-JOIN algorithm is superior to both simulated row-store main memory query processing and the open-source column-store main memory database MonetDB, thanks to the reduced join cost and simple yet efficient query processing. 相似文献

7.

Top-k dominating queries on incomplete large dataset

Wu Jimmy Ming-Tai Wei Min Wu Mu-En Tayeb Shahab 《The Journal of supercomputing》2022,78(3):3976-3997

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.

相似文献

8.

Growing triples on trees: an XML-RDF hybrid model for annotated documents

François Goasdoué Konstantinos Karanasos Yannis Katsis Julien Leblay Ioana Manolescu Stamatis Zampetakis 《The VLDB Journal The International Journal on Very Large Data Bases》2013,22(5):589-613

Since the beginning of the Semantic Web initiative, significant efforts have been invested in finding efficient ways to publish, store, and query metadata on the Web. RDF and SPARQL have become the standard data model and query language, respectively, to describe resources on the Web. Large amounts of RDF data are now available either as stand-alone datasets or as metadata over semi-structured (typically XML) documents. The ability to apply RDF annotations over XML data emphasizes the need to represent and query data and metadata simultaneously. We propose XR, a novel hybrid data model capturing the structural aspects of XML data and the semantics of RDF, also enabling us to reason about XML data. Our model is general enough to describe pure XML or RDF datasets, as well as RDF-annotated XML data, where any XML node can act as a resource. This data model comes with the XRQ query language that combines features of both XQuery and SPARQL. To demonstrate the feasibility of this hybrid XML-RDF data management setting, and to validate its interest, we have developed an XR platform on top of well-known data management systems for XML and RDF. In particular, the platform features several XRQ query processing algorithms, whose performance is experimentally compared. 相似文献

9.

基于3D Gabor多视图主动学习的高光谱图像分类

下载免费PDF全文

姚琼徐翔邹昆《计算机工程与应用》2018,54(22):197-204

针对高光谱遥感图像中标记样本获取困难的问题,研究如何选择少量高质量的查询样本进行交互标记的多视图主动学习算法。首先采用不同尺度和方向的三维Gabor滤波器组提取高光谱图像空谱特征;然后挑选出类别判别能力较强的三维Gabor特征来构建多视图;最后提出一种基于多视图后验概率差异最小（MPPD）的样本查询策略。实验初选30个标记样本,经过100次迭代后,三维Gabor特征多视图结合MPPD查询策略在ROSIS Pavia University和AVIRIS Indiana Pines两个数据集上的总体分类精度分别达到94.16%和91.30%,表明通过三维Gabor可以有效提取高光谱遥感图像空谱特征,提供具有多样性和互补性的特征视图。结合MPPD查询策略能挑选出最有价值的查询样本。相似文献

10.

基于环查询和通道注意力的点云分类与分割

下载免费PDF全文

刘玉珍李楠陶志勇《图学学报》2022,43(4):616-623

点云数据的特征处理是机器人、自动驾驶等领域中三维物体识别技术的关键组成部分,针对点云局部特征信息重复提取、点云物体整体几何结构缺乏识别等问题,提出一种基于环查询和通道注意力的点云分类与分割网络。首先将单层环查询和特征通道注意力机制进行结合,减少局部信息冗余并加强局部特征;然后计算法线变化识别出物体边缘、拐角区域的高响应点,并将其法线特征加入全局特征表示中,加强物体整体几何结构的识别。在 ModelNet40 和 ShapeNet Part 数据集上与多种点云网络进行比较,实验结果表明,该网络不仅有较高的点云分类与分割精度,同时在训练时间和内存占用等方面也优于其他方法,此外对于不同输入点云数量具有较强鲁棒性。因此该网络是一种有效、可行的点云分类与分割网络。相似文献

11.

Beyond independence: probabilistic models for query approximation on binary transaction data 总被引：1，自引：0，他引：1

Pavlov D.N. Mannila H. Smyth P. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(6):1409-1421

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer. 相似文献

12.

基于图结构特征采样数据摘要的联邦知识图谱查询

高峰李秋顾进广《计算机工程》2023,49(1):73-81

联邦SPARQL查询是通过构建查询计划来指导查询执行,数据摘要索引文件捕获了RDF数据集的结构和语义信息,对查询计划生成过程中子查询基数评估至关重要。现有的数据摘要生成方法需要远程遍历每个数据源的完整数据,该过程成本消耗较高,且在大部分环境中联邦查询无法完成对大数据集的统计工作。为在减少数据摘要索引文件生成时间和内存开销的同时捕获尽可能真实的计数信息,考虑主语和谓语的分布偏差,提出利用样图生成原始图近似数据摘要的方法。使用对RDF图出度特征加权的采样方法获取原始图的典型样图,通过改进的映射函数将样图中的信息映射到原始图上,从而生成原始图的近似数据摘要。实验结果表明,该方法相比于基线方法至少节省了70%的数据摘要索引文件生成时间,并且仅采样0.5%的原始图生成的近似数据摘要即可在查询正确率上与基线方法保持高度一致。相似文献

13.

基于哈希编码和卷积神经网络的图像检索方法

下载免费PDF全文

王妙景军锋《计算机工程与应用》2019,55(23):194-199

针对图像检索,提出一种基于哈希编码和卷积神经网络的方法。主要是在卷积神经网络（CNN）中加入哈希层,采用由粗到精的分级检索策略,根据学习到的哈希码进行粗检索得到与查询图像相同或相似的[m]幅图像构成图像池,计算池内图像与查询图像高层语义特征之间的欧氏距离进行精检索,达到最终的检索目的。提出方法将哈希层的损失作为优化目标之一,结合图像的两种特征进行检索,弥补了现有方法中直接利用CNN深层特征检索耗时、占用内存的不足。在印花织物和CIFAR-10数据集上的实验结果表明,提出方法检索性能优于其他现有方法。相似文献

14.

A joint generalized exemplar method for classification of massive datasets

《Applied Soft Computing》2015

Due to technological improvements, the number and volume of datasets are considerably increasing and bring about the need for additional memory and computational complexity. To work with massive datasets in an efficient way; feature selection, data reduction, rule based and exemplar based methods have been introduced. This study presents a method, which may be called joint generalized exemplar (JGE), for classification of massive datasets. This method aims to enhance the computational performance of NGE by working against nesting and overlapping of hyper-rectangles with reassessing the overlapping parts with the same procedure repeatedly and joining non-overlapped hyper-rectangle sections that falling within the same class. This provides an opportunity to have adaptive decision boundaries, and also employing batch data searching instead of incremental searching. Later, the classification was done in accordance with the distance between each particular query and generalized exemplars. The accuracy and time requirements for classification of synthetic datasets and a benchmark dataset obtained by JGE, NGE and other popular machine learning methods were compared and the achieved results by JGE found acceptable. 相似文献

15.

Improvement of Performance of MegaBlast Algorithm for DNA Sequence Alignment

下载免费PDF全文

Guang-Ming Tan Lin Xu Dong-Bo Bu Sheng-Zhong Feng and Ning-Hui Sun 《计算机科学技术学报》2006,21(6):973-978

MegaBlast is one of the most important programs in NCBI BLAST （Basic Local Alignment Search Tool） toolkits, tIowever, MegaBlast is computation and I/O intensive. It consumes a great deal of memory which is proportional to the size of the query sequences set and subject （database） sequences set of product. This paper proposes a new strategy for optimizing MegaBlast. The new strategy exchanges the query and subject sequences sets, and builds a hash table based on new subject sequences. It overlaps I/O with computation, shortens the overall time and reduces the cost of memory, since the memory here is only proportional to the size of subject sequences set. The optimized algorithm is suitable to be parallelized in cluster systems. The parallel algorithm uses query segmentation method. As our experiments shown, the parallel program which is implemented with MPI has fine scalability. 相似文献

16.

Buffer management based on return on consumption in a multi-query environment

Philip S. Yu Ph.D. Douglas W. Cornell Ph.D. 《The VLDB Journal The International Journal on Very Large Data Bases》1993,2(1):1-37

In a multi-query environment, the marginal utilities of allocating additional buffer to the various queries can be vastly different. The conventional approach examines each query in isolation to determine the optimal access plan and the corresponding locality set. This can lead to performance that is far from optimal. As each query can have different access plans with dissimilar locality sets and sensitivities to memory requirement, we employ the concepts of memory consumption and return on consumption (ROC) as the basis for memory allocations. Memory consumption of a query is its space-time product, while ROC is a measure of the effectiveness of response-time reduction through additional memory consumption. A global optimization strategy using simulated annealing is developed, which minimizes the average response over all queries under the constraint that the total memory consumption rate has to be less than the buffer size. It selects the optimal join method and memory allocation for all query types simultaneously. By analyzing the way the optimal strategy makes memory allocations, a heuristic threshold strategy is then proposed. The threshold strategy is based on the concept of ROC. As the memory consumption rate by all queries is limited by the buffer size, the strategy tries to allocate the memory so as to make sure that a certain level of ROC is achieved. A simulation model is developed to demonstrate that the heuristic strategy yields performance that is very close to the optimal strategy and is far superior to the conventional allocation strategy. 相似文献

17.

Algorithms for spatial collocation pattern mining in a limited memory environment: a summary of results

Pawel Boinski Maciej Zakrzewicz 《Journal of Intelligent Information Systems》2014,43(1):147-182

Rapid growth of spatial datasets requires methods to find (semi-)automatically spatial knowledge from these sets. Spatial collocation patterns represent subsets of spatial features whose instances are frequently located together in a spatial neighborhood. In recent years, efficient methods for collocation discovery have been developed, however, none of them assume limited size of the operational memory or limited access to memory with short access times. Such restrictions are especially important in the context of the large size of the data structures required for efficient identification of collocation instances. In this work we present and compare three algorithms for collocation pattern mining in a limited memory environment. The first algorithm is based on the well-known joinless method introduced by Shekhar and Yoo. The second and third algorithms are inspired by a tree structure (iCPI-tree) presented by Wang et al. In our experimental evaluation, we have compared the efficiency of the algorithms, both on synthetic and real world datasets. 相似文献

18.

基于人工蜂群算法的W SN 覆盖优化研究

张洁苏倩韩忠泰《广东电脑与电讯》2020,1(6):52-56

As an emerging network structure, the capsule network uses vector output instead of scalar output, which can capture the spatial relationship between image features and improve the limitations of convolutional neural network. This paper firstly trains the capsule network to achieve image classification, obtains the predictive label of the image, determines the category of the query im- age, and then uses the feature parameters in the digital capsule layer of the network as the feature vector of the image. The feature vector is used to find images similar to the query image in the category set of the query image. In this paper, experiments are carried out on the FASHION-MNIST and CIFAR10 datasets respectively. The experimental results show that the proposed method can bet- ter extract the features of the images and obtain good image retrieval results. 相似文献

19.

强邻近对查询的新方法 总被引：1，自引：1，他引：0

下载免费PDF全文

张丽平李松刘文强王红《计算机工程与应用》2009,45(27):123-126

数据集中有关数据点的强邻近对查询问题在空间数据库和多媒体数据库等领域具有着重要的意义。针对数据规模和数据点分布较为相似的两个数据集设计了双Voronoi图法处理无障碍物环境下的强邻近对查询问题。进而,在有障碍物环境下,根据数据点对被阻断的特点,提出了过滤区域的概念,分情况对数据点对进行处理,缩小了判定范围,减小了大量的冗余计算。理论分析和实验表明该方法具有更广的适用范围,显著提高了强邻近对的查询效率。相似文献

20.

有向无环图上k步可达查询优化算法

杜明杨安平周军锋陈子阳杨云《计算机应用》2020,40(2):426-433

k步可达查询用于在给定的有向无环图（DAG）中回答两点之间是否存在长度不超过k的路径。针对现有方法的索引规模大、查询处理效率低的问题,提出一种基于部分点的双向最短路径索引来提升索引的可达信息覆盖率,并提出一组优化规则来减小索引规模;然后提出基于简化图的正反互逆拓扑索引来加速回答不可达查询;最后提出远距离优先的双向遍历策略来提高查询处理的效率。基于21个真实数据集（如引用网络、社交网络等）的实验结果表明,相比已有的高效方法PLL及BFSI-B,所提出的算法具有更小的索引规模和更快的查询响应速度。相似文献