首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
The architectural choices underlying Linked Data have led to a compendium of data sources which contain both duplicated and fragmented information on a large number of domains. One way to enable non-experts users to access this data compendium is to provide keyword search frameworks that can capitalize on the inherent characteristics of Linked Data. Developing such systems is challenging for three main reasons. First, resources across different datasets or even within the same dataset can be homonyms. Second, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain user query. Finally, constructing a federated formal query from keywords across different datasets requires exploiting links between the different datasets on both the schema and instance levels. We present Sina, a scalable keyword search system that can answer user queries by transforming user-supplied keywords or natural-languages queries into conjunctive SPARQL queries over a set of interlinked data sources. Sina uses a hidden Markov model to determine the most suitable resources for a user-supplied query from different datasets. Moreover, our framework is able to construct federated queries by using the disambiguated resources and leveraging the link structure underlying the datasets to query. We evaluate Sina over three different datasets. We can answer 25 queries from the QALD-1 correctly. Moreover, we perform as well as the best question answering system from the QALD-3 competition by answering 32 questions correctly while also being able to answer queries on distributed sources. We study the runtime of SINA in its mono-core and parallel implementations and draw preliminary conclusions on the scalability of keyword search on Linked Data.  相似文献   

2.
何龙  陈晋川  杜小勇 《软件学报》2017,28(3):502-513
SOH(SQL over HDFS)系统通常将数据存储于分布式文件系统HDFS中,采用Map/Reduce或分布式查询引擎来处理查询任务。得益于HDFS以及Map/Reduce的容错能力和可扩展性,SOH系统可以很好地应对数据规模的飞速增长,完成分析型查询处理。然而,在处理选择型查询或交互式查询时,这类系统暴露出性能上的缺陷。本文提出一个通用的索引技术,可以应用于SOH系统中,以提高其查询处理的效率。分析了SOH系统访问HDFS文件的过程,指出了其中影响数据加载时间的关键因素;提出了split层和split内部双层索引机制;设计并实现了聚集索引和非聚集索引。最后,在标准数据集上进行了大量实验,并与现有基于HDFS的索引技术进行了比较。实验结果表明,所提出的索引技术可以有效地提高查询处理的效率。  相似文献   

3.
On-line analytical processing (OLAP) typically involves complex aggregate queries over large datasets. The data cube has been proposed as a structure that materializes the results of such queries in order to accelerate OLAP. A significant fraction of the related work has been on Relational-OLAP (ROLAP) techniques, which are based on relational technology. Existing ROLAP cubing solutions mainly focus on “flat” datasets, which do not include hierarchies in their dimensions. Nevertheless, as shown in this paper, the nature of hierarchies introduces several complications into the entire lifecycle of a data cube including the operations of construction, storage, indexing, query processing, and incremental maintenance. This fact renders existing techniques essentially inapplicable in a significant number of real-world applications and mandates revisiting the entire cube lifecycle under the new perspective. In order to overcome this problem, the CURE algorithm has been recently proposed as an efficient mechanism to construct complete cubes over large datasets with arbitrary hierarchies and store them in a highly compressed format, compatible with the relational model. In this paper, we study the remaining phases in the cube lifecycle and introduce query-processing and incremental-maintenance algorithms for CURE cubes. These are significantly different from earlier approaches, which have been proposed for flat cubes constructed by other techniques and are inadequate for CURE due to its high compression rate and the presence of hierarchies. Our methods address issues such as cube indexing, query optimization, and lazy update policies. Especially regarding updates, such lazy approaches are applied for the first time on cubes. We demonstrate the effectiveness of CURE in all phases of the cube lifecycle through experiments on both real-world and synthetic datasets. Among the experimental results, we distinguish those that have made CURE the first ROLAP technique to complete the construction and usage of the cube of the highest-density dataset in the APB-1 benchmark (12 GB). CURE was in fact quite efficient on this, showing great promise with respect to the potential of the technique overall.  相似文献   

4.
We address efficient processing of SPARQL queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uniform and scalable manner, SPARQL queries with wildcards and aggregate operators over dynamic RDF datasets. Our approach is graph based. We store RDF data as a large graph and also represent a SPARQL query as a query graph. Thus, the query answering problem is converted into a subgraph matching problem. To achieve efficient and scalable query processing, we develop an index, together with effective pruning rules and efficient search algorithms. We propose techniques that use this infrastructure to answer aggregation queries. We also propose an effective maintenance algorithm to handle online updates over RDF repositories. Extensive experiments confirm the efficiency and effectiveness of our solutions.  相似文献   

5.
6.
As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One of the grant challenges for managing these huge RDF data is how to execute RDF queries efficiently. In this paper, we address the query processing problems against the billion triple challenges. We first identify some causes for the problems of existing query optimization schemes, such as large intermediate results, initial query cost estimation errors. Then, we present our block-oriented dynamic query plan generation approach powered with pipelining execution. Our approach consists of two phases. In the first phase, a near-optimal execution plan for queries is chosen by identifying the processing blocks of queries. We group the join patterns sharing a join variable into building blocks of the query plan since executing them first provides opportunities to reduce the size of intermediate results generated. In the second phase, we further optimize the initial pipelining for a given query plan. We employ optimization techniques, such as sideways information passing and semi-join, to further reduce the size of intermediate results, improve the query processing cost estimation and speed up the performance of query execution. Experimental results on several RDF datasets of over a billion triples demonstrate that our approach outperforms existing RDF query engines that rely on dynamic programming based static query processing strategies.  相似文献   

7.
Information search and retrieval from a remote database (e.g., cloud server) involves a multitude of privacy issues. Submitted search terms and their frequencies, returned responses and order of their relevance, and retrieved data items may contain sensitive information about the users. In this paper, we propose an efficient multi-keyword search scheme that ensures users’ privacy against both external adversaries including other authorized users and cloud server itself. The proposed scheme uses cryptographic techniques as well as query and response randomization. Provided that the security and randomization parameters are appropriately chosen, both search terms in queries and returned responses are protected against privacy violations. The scheme implements strict security and privacy requirements that essentially disallow linking queries featuring identical search terms. We also incorporate an effective ranking capability in the scheme that enables user to retrieve only the top matching results. Our comprehensive analytical study and extensive experiments using both real and synthetic datasets demonstrate that the proposed scheme is privacy-preserving, effective, and highly efficient.  相似文献   

8.
With the advent of the era of cloud computing and big data, in order to cope with vast amounts of data, a number of key-value databases have emerged. These systems provide the ability of large scale data storage and effective data operations based on primary keys, but they do not efficiently support the range and k-Nearest Neighbor (kNN) queries on multi-dimensional datasets. In this paper, we introduce, SPIKE, a sliced Pyramid-based index system for key-value data stores. SPIKE bridges the gap between the data scale and querying functionality for highly available, scalable distributed key-value data stores. We first present SP-Index, the kernel indexing scheme. The SP-Index is designed as a two-level index mechanism consisting of a sliced pyramid space partition index and a distributed B-Tree index. On the basis of SP-Index, we have designed and implemented SPIKE on Cassandra, which provides efficient multi-dimensional complex query processing. We have conducted a set of comprehensive experiments with three types of datasets including synthetic datasets, TPC-H benchmark datasets and a real-world dataset. The experiment results show that SPIKE can efficiently handle multi-dimensional complex queries on large-scale key-value datasets. Evaluation results in comparison with existing systems demonstrates that SPIKE outperforms the comparing work including the original Pyramid, MySQL Cluster and CCIndex by dozens of times in complex query processing.  相似文献   

9.
10.
We present a high level query language, called HIFUN, for defining analytic queries over big datasets, independently of how these queries are evaluated. An analytic query in HIFUN is defined to be a well-formed expression of a functional algebra that we define in the paper. The operations of this algebra combine functions to create HIFUN queries in much the same way as the operations of the relational algebra combine relations to create algebraic queries. The contributions of this paper are: (a) the definition of a formal framework in which to study analytic queries in the abstract; (b) the encoding of a HIFUN query either as a MapReduce job or as an SQL group-by query; and (c) the definition of a formal method for rewriting HIFUN queries and, as a case study, its application to the rewriting of MapReduce jobs and of SQL group-by queries. We emphasize that, although theoretical in nature, our work uses only basic and well known mathematical concepts, namely functions and their basic operations.  相似文献   

11.
Given a set S of sites and a set O of weighted objects, an optimal location query finds the location(s) where introducing a new site maximizes the total weight of the objects that are closer to the new site than to any other site. With such a query, for instance, a franchise corporation (e.g., McDonald’s) can find a location to open a new store such that the number of potential store customers (i.e., people living close to the store) is maximized. Optimal location queries are computationally complex to compute and require efficient solutions that scale with large datasets. Previously, two specific approaches have been proposed for efficient computation of optimal location queries. However, they both assume p-norm distance (namely, L1 and L2/Euclidean); hence, they are not applicable where sites and objects are located on spatial networks. In this article, we focus on optimal network location (ONL) queries, i.e., optimal location queries in which objects and sites reside on a spatial network. We introduce two complementary approaches, namely EONL (short for Expansion-based ONL) and BONL (short for Bound-based ONL), which enable efficient computation of ONL queries with datasets of uniform and skewed distributions, respectively. Moreover, with an extensive experimental study we verify and compare the efficiency of our proposed approaches with real world datasets, and we demonstrate the importance of considering network distance (rather than p-norm distance) with ONL queries.  相似文献   

12.
时间序列数据在能源、制造、金融、气候等领域有着广泛应用,聚合查询是相关分析场景中常见的查询需求,快速获取海量数据的概要信息,对于提高数据分析工作的效率具有重要意义.通过存储元数据加速聚合查询是一种有效的提升聚合查询执行效率的手段,但现有的时间序列数据库都使用时间窗口切分数据,需要对数据进行实时排序和分区,难以适应物联网场景下高并发、大吞吐量的数据写入特点.因此,提出了一种面向聚合查询的ApacheIoTDB物理元数据管理方案.该方案按照数据文件的物理存储特性切分数据,并结合同步计算和异步计算策略,优先保证数据的写入性能.针对时间序列数据中普遍存在的乱序数据,将时间范围重叠的一组文件抽象为乱序文件组并提供元数据,聚合查询会被重写为3个结合物理元数据和原始数据的子查询高效执行.多个数据集上的实验验证了该方案对聚合查询执行效率的提升效果以及不同计算策略对性能的影响.  相似文献   

13.
在使用"不完全结构的约束查询(PSTP查询)"从XML文档中获取信息时,用户可以根据自身对XML文档结构的熟悉程度,在查询表达式中灵活地嵌入结构约束条件,从而满足完全不了解、完全了解及了解部分结构信息的各种用户的查询需求。提出一种基于扩展Dewey编码的查询处理算法,可以在仅扫描一遍元素的情况下,处理任意形式的PSTP查询。不同数据集上的实验结果表明,EDPS算法在处理twig查询、不包含"*"结点的PSTP查询及包含"*"结点的PSTP查询时,综合性能明显优于已有方法。  相似文献   

14.
The performance optimization of query processing in spatial networks focuses on minimizing network data accesses and the cost of network distance calculations. This paper proposes algorithms for network k-NN queries, range queries, closest-pair queries and multi-source skyline queries based on a novel processing framework, namely, incremental lower bound constraint. By giving high processing priority to the query associated data points and utilizing the incremental nature of the lower bound, the performance of our algorithms is better optimized in contrast to the corresponding algorithms based on known framework incremental Euclidean restriction and incremental network expansion. More importantly, the proposed algorithms are proven to be instance optimal among classes of algorithms. Through experiments on real road network datasets, the superiority of the proposed algorithms is demonstrated.  相似文献   

15.
Huang  Jinjing  Chen  Wei  Liu  An  Wang  Weiqing  Yin  Hongzhi  Zhao  Lei 《World Wide Web》2020,23(2):755-779

A temporal knowledge graph (TKG) is theoretically a temporal graph. Recently, systems have been developed to support snapshot queries over temporal graphs. However, snapshot queries can only give separate answers. To retrieve forward-backward correlation facts from temporal knowledge graph, cluster query is proposed in this paper. To deal with the query, the logical view and physical model are presented. Subsequently, five corresponding basic query patters of unit matching are studied, and then the complete matchings are also addressed. To improve the query performance, index-based methods and pruning strategies are adopted. Experiments are conducted to evaluate cluster queries on three real datasets. The experimental results show the effectiveness and efficiency of cluster queries on temporal knowledge graphs.

  相似文献   

16.
扩展的锥形方向关系查询处理方法   总被引:1,自引:0,他引:1       下载免费PDF全文
通过加入距离约束,扩展锥形方向关系的描述方式,提出新的查询处理方法——扩展锥形方向二叉树,该方法能处理方向空间连接的查询过滤,通过组合方向和距离关系,提高定性推理的准确性。与传统基于索引的方法相比,该方法能够有效处理大数据集中任意对象间方向关系的查询和定性推理,实现简单、查询效率和推理准确性较高。  相似文献   

17.
Top-k查询是Web和多媒体搜索、决策支持、分布式系统等众多领域中最重要的查询之一,它返回数据集合中k个最关键的元组.大型数据集合往往包含一系列分类型属性,获取对目标属性影响最大的k个分类型属性值对于许多应用中也非常重要.研究了这个问题,正式定义了k-AKC和PKC两种查询,并设计相应的查询处理算法.实验结果表明,改良算法PKCQ+具有较佳的有效性和高效性.  相似文献   

18.
19.
李培  翁伟  林琛 《中文信息学报》2016,30(3):143-151
新浪微博、腾讯微博等微博平台已经成为国内重要的网络媒体。随着海量的实时信息在微博上分享和传播,为每个用户提供更多方便,展现一目了然的实事资讯的任务已经迫在眉睫。这就需要在微博中理出重大事件的发展进程。该文中,我们将利用最小权重支配集和有向斯坦纳树在给定查询的微博数据集上生成故事线。该文的工作由三部分组成:第一部分是在Lucene检索出来的结果集上构建多视点图;其次,通过在图中寻找最小权重支配集来选出具有代表性的微博;最后,通过求解有向斯坦纳树问题来平滑地连接这些已挑选的微博,形成故事线。在实际数据集上的实验验证了该文提出系统的高效性和有效性。
  相似文献   

20.
With the ever-growing popularity of smartphone devices in recent years, skyline queries over spatial Web objects in road networks have received increasing attention. In the literature, various techniques have been developed to tackle skyline queries that take both spatial and non-spatial attributes into consideration. However, the existing solutions only focus on solving point-based queries, where the query location is a spatial point. We observe that in many real-life applications, the user location is often represented by a spatial range. Thus, in this paper, we study a new problem of range-based skyline queries (CRSQs) in road networks. Two efficient algorithms named landmark-based (LBA) and index-based (IBA) algorithms are proposed. We also present incremental versions of LBA and IBA to handle continuous range-based skyline queries over moving objects. Extensive experiments using real road network datasets demonstrate the effectiveness and efficiency of our proposed algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号