首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the top-k objects at time t with respect to some score attribute. Some generic indexing structures based on R-trees do support ranking queries on temporal data, but as they are not tailored for such queries, the performance is far from satisfactory. We present the Seb-tree, a simple indexing scheme that supports temporal ranking queries much more efficiently. The Seb-tree answers a top-k query for any time instance t in the optimal number of I/Os in expectation, namely, O(logB \fracNB+\frackB){O\left({\rm log}_B\,\frac{N}{B}+\frac{k}{B}\right)} I/Os, where N is the size of the data set and B is the disk block size. The index has near-linear size (for constant and reasonable k max values, where k max is the maximum value for the possible values of the query parameter k), can be constructed in near-linear time, and also supports insertions and deletions without affecting its query performance guarantee. Most of all, the Seb-tree is especially appealing in practice due to its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of large data sets, show that the Seb-tree is more than an order of magnitude faster than the R-tree based indexes for temporal ranking queries.  相似文献   

2.
Searching XML data using keyword queries has attracted much attention because it enables Web users to easily access XML data without having to learn a structured query language or study possibly complex data schemas. Most of the current approaches identify the meaningful results of a given keyword query based on the semantics of lowest common ancestor (LCA) and its variants. However, given the fact that LCA candidates are usually numerous and of low relevance to the users?? information need, how to effectively and efficiently identify the most relevant results from a large number of LCA candidates is still a challenging and unresolved issue. In this article, we introduce a novel semantics of relevant results based on mutual information between the query keywords. Then, we introduce a novel approach for identifying the relevant answers of a given query by adopting skyline semantics. We also recommend three different ranking criteria for selecting the top-k relevant results of the query. Efficient algorithms are proposed which rely on some provable properties of the dominance relationship between result candidates to rapidly identify the top-k dominant results. Extensive experiments were conducted to evaluate our approach and the results show that the proposed approach has a good performance compared with other existing approaches in different data sets and evaluation metrics  相似文献   

3.
Wu  Jimmy Ming-Tai  Wei  Min  Wu  Mu-En  Tayeb  Shahab 《The Journal of supercomputing》2022,78(3):3976-3997

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.

  相似文献   

4.
Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like “Who are the top-k most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Moreover, to answer questions like “Who are the best k typical guards in whole representing different types of guards?”, the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.  相似文献   

5.
Reducing network traffic in unstructured P2P systems using Top-k queries   总被引:1,自引:0,他引:1  
A major problem of unstructured P2P systems is their heavy network traffic. This is caused mainly by high numbers of query answers, many of which are irrelevant for users. One solution to this problem is to use Top-k queries whereby the user can specify a limited number (k) of the most relevant answers. In this paper, we present FD, a (Fully Distributed) framework for executing Top-k queries in unstructured P2P systems, with the objective of reducing network traffic. FD consists of a family of algorithms that are simple but effective. FD is completely distributed, does not depend on the existence of certain peers, and addresses the volatility of peers during query execution. We validated FD through implementation over a 64-node cluster and simulation using the BRITE topology generator and SimJava. Our performance evaluation shows that FD can achieve major performance gains in terms of communication and response time. Recommended by: Sunil Prabhakar Work partially funded by the ARA Massive Data of the Agence Nationale de la Recherche.  相似文献   

6.
近年来,分布式系统中的数据流监测是一个十分活跃的领域。研究了如何实现通用并且高效的分布式top-k监测,即在分布的多数据流中根据用户给定的排序函数连续监测最大的k个值。在实际应用中,用户给定的排序函数可能是任意的排序函数,然而,目前的分布式top-k监测技术只支持加法作为排序函数。提出了一种通用的支持任意的连续的严格单调的聚集函数的分布式top-k监测算法GMR。GMR的通讯代价和k无关。通过真实世界数据和模拟数据验证了GMR的效率。实验表明,GMR的网络通讯量比同类方法低一个数量级以上。  相似文献   

7.
Given the heterogeneity of complex graph data on the web, such as RDF linked data, it is likely that a user wishing to query such data will lack full knowledge of the structure of the data and of its irregularities. Hence, providing flexible querying capabilities that assist users in formulating their information seeking requirements is highly desirable. In this paper we undertake a detailed theoretical investigation of query approximation, query relaxation, and their combination, for this purpose. The query language we adopt comprises conjunctions of regular path queries, thus encompassing recent extensions to SPARQL to allow for querying paths in graphs using regular expressions (SPARQL 1.1). To this language we add standard notions of query approximation based on edit distance, as well as query relaxation based on RDFS inference rules. We show how both of these notions can be integrated into a single theoretical framework and we provide incremental evaluation algorithms that run in polynomial time in the size of the query and the data, returning answers in ranked order of their ‘distance’ from the original query. We also combine for the first time these two disparate notions into a single ‘flex’ operation that simultaneously applies both approximation and relaxation to a query conjunct, providing even greater flexibility for users, but still retaining polynomial time evaluation complexity and the ability to return query answers in ranked order.  相似文献   

8.
Web users often post queries through form-based interfaces on the Web to retrieve data from the Web; however, answers to these queries are mostly computed according to keywords entered into different fields specified in a query interface, and their precision and recall could be low. The precision and recall ratios in answering this type of query can be improved by considering closely related previous queries submitted through the same interface, along with their answers. In this paper, we present an approach for enhancing the retrieval of relevant answers to a form-based Web query by adopting the data-mining approach using previous, relevant queries and their answers. Experimental results on a randomly selected set of 3,800 documents retrieved from various Web sites show that our data-mining, query-rewriting approach achieves average precision and true positive ratios on rewritten queries in the upper 80% range, whereas the average false positive ratio is less than 2.0%. Work partially done during a visit to BYU and partially supported by National Natural Science Foundation of China No. 60503036 and Fok YingTong Education Foundation No. 104027.  相似文献   

9.
Optimizing taxonomic semantic web queries using labeling schemes   总被引:1,自引:0,他引:1  
This paper focuses on the optimization of the navigation through voluminous subsumption hierarchies of topics employed by portal catalogs like Netscape Open Directory (ODP). We advocate for the use of labeling schemes for modeling these hierarchies in order to efficiently answer queries such as subsumption check, descendants, ancestors or nearest common ancestor, which usually require costly transitive closure computations. We first give a qualitative comparison of three main families of schemes, namely bit vector, prefix and interval based schemes. We then show that two labeling schemes are good candidates for an efficient implementation of label querying using standard relational DBMS, namely the Dewey prefix scheme and an interval scheme by Agrawal et al. We compare their storage and query evaluation performance for the 16 ODP hierarchies using the PostgreSQL engine.  相似文献   

10.
Efficient queries over Web views   总被引:1,自引:0,他引:1  
Large Web sites are becoming repositories of structured information that can benefit from being viewed and queried as relational databases. However, querying these views efficiently requires new techniques. Data usually resides at a remote site and is organized as a set of related HTML documents, with network access being a primary cost factor in query evaluation. This cost can be reduced by exploiting the redundancy often found in site design. We use a simple data model, a subset of the Araneus data model, to describe the structure of a Web site. We augment the model with link and inclusion constraints that capture the redundancies in the site. We map relational views of a site to a navigational algebra and show how to use the constraints to rewrite algebraic expressions, reducing the number of network accesses. We show that similar techniques can be used to maintain materialized views over sets of HTML pages.  相似文献   

11.
Traditional testing techniques are not adequate for web-based applications, since they miss their additional features such as their multi-tier nature, hyperlink-based structure, and event-driven feature. Limited work has been done on testing web applications. In this paper, we propose new techniques for white box testing of web applications developed in the .NET environment with emphasis on their event-driven feature. We extend recent work on modeling of web applications by enhancing previous dependence graphs and proposing an event-based dependence graph model. We apply data flow testing techniques to these dependence graphs and propose an event flow testing technique. Also, we present a few coverage testing approaches for web applications. Further, we propose mutation testing operators for evaluating the adequacy of web application tests.  相似文献   

12.
This paper describes a process for mashing heterogeneous data sources based on the Multi-data source Fusion Approach (MFA) (Nachouki and Quafafou, 2008 [52]). The aim of MFA is to facilitate the fusion of heterogeneous data sources in dynamic contexts such as the Web. Data sources are either static or active: static data sources can be structured or semi-structured (e.g. XML documents or databases), whereas active sources are services (e.g. Web services). Our main objective is to combine (Web) data sources with a minimal effort required from the user. This objective is crucial because the mashing process implies easy and fast integration of data sources. We suppose that the user is not expert in this field but he/she understands the meaning of data being integrated. In this paper, we consider two important aspects of the Web mashing process. The first one concerns the information extraction from the Web. The results of this process are the static data sources that are used later together with services in order to create a new result/application. The second one concerns the problem of semantic reconciliation of data sources. This step consists to generate the Conflicts data source in order to improve the problem of rewriting semantic queries into sub-queries (not addressed in this paper) over data sources. We give the design of our system MDSManager. We show this process through a real-life application.  相似文献   

13.
为了解决连续不确定XML高效的top-k查询,提出CProTJFast算法.该算法基于P-文档模型,扩展PEDewey(probabilistic extended Dewey)编码支持连续分布类型节点的编码,采用路径概率下限值进行节点过滤,并针对连续概率密度函数制定过滤策略,从而在计算连续节点概率之前过滤掉不参与结果的节点.实验结果表明,采用连续节点过滤策略的CProTJFast算法有效地提高了连续不确定XML的top-k查询效率.  相似文献   

14.
孙永佼  袁野  王国仁 《计算机学报》2011,34(11):2155-2164
分布式环境中的top-k查询已经有了广泛的研究.由于仪器不精确和网络延时等原因,大多数分布式数据都存在不确定性.文中基于水平分布在P2P网络中的不确定数据提出了一个有效的top-k查询处理方法.首先利用Quad-tree构建一个分布式的不确定数据的索引,并基于索引提出了一个空间剪枝算法.然后,根据局部top-k概率与全...  相似文献   

15.
16.
We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a “partial evaluation and assembly” framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assembly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system’s performance and scalability.  相似文献   

17.
Optimizing top-k selection queries over multimedia repositories   总被引:2,自引:0,他引:2  
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically, request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). Furthermore, unlike in the relational model, users may just want the k top-ranked objects for their selection queries for a relatively small k. In addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of top-k selection queries over multimedia repositories. The access characteristics of the repositories and the above query model lead to novel issues in query optimization. In particular, the choice of the indexes used to search the repository strongly influences the cost of processing the filtering condition. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent. We also show that the problem of optimizing top-k selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions. Thus, both problems can be viewed together as an extended filtering problem to which techniques of query processing and optimization may be adapted.  相似文献   

18.
Automatically identifying the user intent behind web queries has started to catch the attention of the research community, since it allows search engines to enhance user experience by adapting results to that goal. It is broadly agreed that there are three archetypal intentions behind search queries: navigational, resource/transactional and informational.Thus, as a natural consequence, this task has been interpreted as a multi-class classification problem. At large, recent works have focused on comparing several machine learning methods built with words as features. Conversely, this paper examines the influence of assorted properties on three classification approaches. In particular, it focuses its attention on the contribution of linguistic-based attributes. However, most of natural language processing tools are designed for documents, not web queries. Therefore, as a means of bridging this linguistic gap, we benefited from caseless models, which are trained with traditionally labeled data, but all terms are converted to lowercase before their generation.Overall, tested attributes proved to be effective by improving on word-based classifiers by up to 8.347% (accuracy), and outperforming a baseline by up to 6.17%. Most notably, linguistic-oriented features, from caseless models, are shown to be instrumental in narrowing the linguistic gap between queries and documents.  相似文献   

19.
A survey of queries over uncertain data   总被引:1,自引:1,他引:0  
Uncertain data have already widely existed in many practical applications recently, such as sensor networks, RFID networks, location-based services, and mobile object management. Query processing over uncertain data as an important aspect of uncertain data management has received increasing attention in the field of database. Uncertain query processing poses inherent challenges and demands non-traditional techniques, due to the data uncertainty. This paper surveys this interesting and still evolving research area in current database community, so that readers can easily obtain an overview of the state-of-the-art techniques. We first provide an overview of data uncertainty, including uncertainty types, probability representation models, and sources of probabilities. We next outline the current major types of uncertain queries and summarize the main features of uncertain queries. Particularly, we present and analyze several typical uncertain queries in detail, such as skyline queries, top- $k$ queries, nearest-neighbor queries, aggregate queries, join queries, range queries, and threshold queries over uncertain data. Finally, we present many interesting research topics on uncertain queries that have not yet been explored.  相似文献   

20.
Secure multidimensional range queries over outsourced data   总被引:1,自引:0,他引:1  
In this paper, we study the problem of supporting multidimensional range queries on encrypted data. The problem is motivated by secure data outsourcing applications where a client may store his/her data on a remote server in encrypted form and want to execute queries using server??s computational capabilities. The solution approach is to compute a secure indexing tag of the data by applying bucketization (a generic form of data partitioning) which prevents the server from learning exact values but still allows it to check if a record satisfies the query predicate. Queries are evaluated in an approximate manner where the returned set of records may contain some false positives. These records then need to be weeded out by the client which comprises the computational overhead of our scheme. We develop a bucketization procedure for answering multidimensional range queries on multidimensional data. For a given bucketization scheme, we derive cost and disclosure-risk metrics that estimate client??s computational overhead and disclosure risk respectively. Given a multidimensional dataset, its bucketization is posed as an optimization problem where the goal is to minimize the risk of disclosure while keeping query cost (client??s computational overhead) below a certain user-specified threshold value. We provide a tunable data bucketization algorithm that allows the data owner to control the trade-off between disclosure risk and cost. We also study the trade-off characteristics through an extensive set of experiments on real and synthetic data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号