首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
The conventional approaches of finding related search engine queries rely on the common terms shared by two queries to measure their relatedness. However, search engine queries are usually short and the term overlap between two queries is very small. Using query terms as a feature space cannot accurately estimate relatedness. Alternative feature spaces are needed to enrich the term based search queries. In this paper, given a search query, first we extract the Web pages accessed by users from Japanese Web access logs which store the users individual and collective behavior. From these accessed Web pages we usually can get two kinds of feature spaces, i.e, content-sensitive (e.g., nouns) and content-ignorant (e.g., URLs), to enrich the expressions of search queries. Then, the relatedness between search queries can be estimated on their enriched expressions. Our experimental results show that the URL feature space produces much lower precision scores than the noun feature space which, however, is not applicable in non-text pages, dynamic pages and so on. It is crucial to improve the quality of the URL (content-ignorant) feature space since it is generally available in all types of Web pages. We propose a novel content-ignorant feature space, called Web community which is created from a Japanese Web page archive by exploiting link analysis. Experimental results show that the proposed Web community feature space generates much better results than the URL feature space.  相似文献   

Recently, many new applications, such as sensor data monitoring and mobile device tracking, raise up the issue of uncertain data management. Compared to "certain” data, the data in the uncertain database are not exact points, which, instead, often reside within a region. In this paper, we study the ranked queries over uncertain data. In fact, ranked queries have been studied extensively in traditional database literature due to their popularity in many applications, such as decision making, recommendation raising, and data mining tasks. Many proposals have been made in order to improve the efficiency in answering ranked queries. However, the existing approaches are all based on the assumption that the underlying data are exact (or certain). Due to the intrinsic differences between uncertain and certain data, these methods are designed only for ranked queries in certain databases and cannot be applied to uncertain case directly. Motivated by this, we propose novel solutions to speed up the probabilistic ranked query (PRank) with monotonic preference functions over the uncertain database. Specifically, we introduce two effective pruning methods, spatial and probabilistic pruning, to help reduce the PRank search space. A special case of PRank with linear preference functions is also studied. Then, we seamlessly integrate these pruning heuristics into the PRank query procedure. Furthermore, we propose and tackle the PRank query processing over the join of two distinct uncertain databases. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches in answering PRank queries, in terms of both wall clock time and the number of candidates to be refined.  相似文献   

搜索引擎中,在线拼写纠错根据用户查询输入补全用户查询,并给出正确的拼写建议。提出了一种面向查询补全的在线拼写纠错算法。基于真实查询的噪声信道转换方式,算法建立了用户查询输入的生成模型;利用拼写纠错对,算法采用期望最大化算法训练能捕获用户误拼行为的马尔科夫N语法转换模型;算法采用不同剪枝策略的启发式改进A*搜索算法以实现实时给出纠错补全建议。实验结果表明,提出的算法相比其他同类算法更有效。  相似文献   

In complex search tasks, it is often required to pose several basic search queries, join the answers to these queries, where each answer is given as a ranked list of items, and return a ranked list of combinations. However, the join result may include too many repetitions of items, and hence, frequently the entire join is too large to be useful. This can be solved by choosing a small subset of the join result. The focus of this paper is on how to choose this subset. We propose two measures for estimating the quality of result sets, namely, coverage and optimality ratio. Intuitively, maximizing the coverage aims at including in the result as many as possible appearances of items in their optimal combination, and maximizing the optimality ratio means striving to have each item appearing only in its optimal combination, i.e., only in the most highly ranked combination that contains it. One of the difficulties, when choosing the subset of the join in a complex search, is that there is a conflict between maximizing the coverage and maximizing the optimality ratio. In this paper, we introduce the measures coverage and optimality ratio. We present new semantics for complex search queries, aiming at providing high coverage and high optimality ratio. We examine the quality of the results of existing and the novel semantics, according to these two measures, and we provide algorithms for answering complex search queries under the new semantics. Finally, we present an experimental study, using Yahoo! Local Search Web Services, of the efficiency and the scalability of our algorithms, showing that complex search queries can be evaluated effectively under the proposed semantics.  相似文献   

Peer-to-peer (P2P) technology provides a popular way of distributing resources, sharing, and locating in a large-scale distributed environment. However, most of the current existing P2P systems only support queries over a single resource attribute, such as file name. The current multiple resource attribute search methods often encounter high maintenance cost and lack of resilience to the highly dynamic environment of P2P networks. In this paper, we propose a Flabellate overlAy Network (FAN), a scalable and structured underlying P2P overlay supporting resource queries over multi-dimensional attributes. In FAN, the resources are mapped into a multi-dimensional Cartesian space based on the consistent hash values of the resource attributes. The mapping space is divided into non-overlapping and continuous subspaces based on the peer’s distance. This paper presents strategies for managing the extended adjacent subspaces, which is crucial to network maintenance and resource search in FAN. The algorithms of a basic resource search and range query over FAN are also presented in this paper. To alleviate the load of the hot nodes, a virtual replica network (VRN) consisting of the nodes with the same replicates is proposed for replicating popular resources adaptively. The queries can be forwarded from the heavily loaded nodes to the lightly loaded ones through VRN. Theoretical analysis and experimental results show that FAN has a higher routing efficiency and lower network maintenance cost over the existing multi-attribute search methods. Also, VRN efficiently balances the network load and reduces the querying delay in FAN while invoking a relatively low overhead.  相似文献   

The problem of multidimensional file partitioning (MDFP) arises in large databases that are subject to frequent range queries on one or more attributes. In an MDFP scheme, the search attribute space is partitioned into cells, which are mapped to physical disk locations. This mapping preserves the order of the search attribute values so that range queries can be answered most efficiently, while maintaining good performance for other types of queries. Recently, MDFP schemes have been suggested to include both dynamic and static file organizations. Optimal and heuristic MDFP algorithms are developed for the static case. The results of extensive computational experiments show that the proposed heuristics perform better than known static ones. It is also shown that incorporating a static algorithm into a dynamic MDFP such as a grid file at conversion and/or periodical reorganization points significantly improves the resulting storage utilization of the data file and decreases the size of the directory file  相似文献   

Reachability query plays a vital role in many graph analysis tasks. Previous researches proposed many methods to efficiently answer reachability queries between vertex pairs. Since many real graphs are labeled graph, it highly demands Label-Constrained Reachability (LCR) query in which constraint includes a set of labels besides vertex pairs. Recent researches proposed several methods for answering some LCR queries which require appearance of some labels specified in constraints in the path. Besides that constraint may be a label set, query constraint may be ordered labels, namely OLCR (Ordered-Label-Constrained Reachability) queries which retrieve paths matching a sequence of labels. Currently, no solutions are available for OLCR. Here, we propose DHL, a novel bloom filter based indexing technique for answering OLCR queries. DHL can be used to check reachability between vertex pairs. If the answers are not no, then constrained DFS is performed. So, we employ DHL followed by performing constrained DFS to answer OLCR queries. We show that DHL has a bounded false positive rate, and it’s powerful in saving indexing time and space. Extensive experiments on 10 real-life graphs and 12 synthetic graphs demonstrate that DHL achieves about 4.8–22.5 times smaller index space and 4.6–114 times less index construction time than two state-of-art techniques for LCR queries, while achieving comparable query response time. The results also show that our algorithm can answer OLCR queries effectively.  相似文献   

Due to the universality and importance of range search queries processing in mobile and spatial databases as well as in geographic information system (GIS), numerous approaches on range search algorithms have been proposed in recent years. But ordinary range search queries focus only on a specific type of point objects. For queries which require to retrieve objects of interest locating in a particular region, ordinary range search could not get the expected results. In addition, most existing range search methods need to perform a searching on each road segments within the pre-defined range, which decreases the performance of range search. In this paper, we design a weighted network Voronoi diagram and propose a high-performance multilevel range search query processing that retrieves a set of objects locating in some specified region within the searching range. The experimental results show that our proposed algorithm runs very efficiently and outperforms its main competitor.  相似文献   

Optimization and evaluation of shortest path queries   总被引:1,自引:0,他引:1  
We investigate the problem of how to evaluate efficiently a collection of shortest path queries on massive graphs that are too big to fit in the main memory. To evaluate a shortest path query efficiently, we introduce two pruning algorithms. These algorithms differ on the extent of materialization of shortest path cost and on how the search space is pruned. By grouping shortest path queries properly, batch processing improves the performance of shortest path query evaluation. Extensive study is also done on fragment sizes, cache sizes and query types that we show that affect the performance of a disk-based shortest path algorithm. The performance and scalability of proposed techniques are evaluated with large road systems in the Eastern United States. To demonstrate that the proposed disk-based algorithms are viable, we show that their search times are significant better than that of main-memory Dijkstra's algorithm.  相似文献   

Traditional nearest-neighbor (NN) search is based on two basic indexing approaches: object-based indexing and solution-based indexing. The former is constructed based on the locations of data objects: using some distance heuristics on object locations. The latter is built on a precomputed solution space. Thus, NN queries can be reduced to and processed as simple point queries in this solution space. Both approaches exhibit some disadvantages, especially when employed for wireless data broadcast in mobile computing environments. In this paper, we introduce a new index method, called the grid-partition index, to support NN search in both on-demand access and periodic broadcast modes of mobile computing. The grid-partition index is constructed based on the Voronoi diagram, i.e., the solution space of NN queries. However, it has two distinctive characteristics. First, it divides the solution space into grid cells such that a query point can be efficiently mapped into a grid cell around which the nearest object is located. This significantly reduces the search space. Second, the grid-partition index stores the objects that are potential NNs of any query falling within the cell. The storage of objects, instead of the Voronoi cells, makes the grid-partition index a hybrid of the solution-based and object-based approaches. As a result, it achieves a much more compact representation than the pure solution-based approach and avoids backtracked traversals required in the typical object-based approach, thus realizing the advantages of both approaches. We develop an incremental construction algorithm to address the issue of object update. In addition, we present a cost model to approximate the search cost of different grid partitioning schemes. The performances of the grid-partition index and existing indexes are evaluated using both synthetic and real data. The results show that, overall, the grid-partition index significantly outperforms object-based indexes and solution-based indexes. Furthermore, we extend the grid-partition index to support continuous-nearest-neighbor search. Both algorithms and experimental results are presented. Edited by R. Guting  相似文献   

In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.  相似文献   

We consider the problem of indexing a set of objects moving in d-dimensional spaces along linear trajectories. A simple external-memory indexing scheme is proposed to efficiently answer general range queries. The following are examples of the queries that can be answered by the proposed method: report all moving objects that will (i) pass between two given points within a specified time interval; (ii) become within a given distance from some or all of a given set of other moving objects. Our scheme is based on mapping the objects to a dual space, where queries about moving objects are transformed into polyhedral queries concerning their speeds and initial locations. We then present a simple method for answering such polyhedral queries, based on partitioning the space into disjoint regions and using a B+-tree to index the points in each region. By appropriately selecting the boundaries of each region, we guarantee an average search time that matches a known lower bound for the problem. Specifically, for a fixed d, if the coordinates of a given set of N points are statistically independent, the proposed technique answers polyhedral queries, on the average, in O((N/B)1−1/d⋅(log B N)1/d+K/B) I/O's using O(N/B) space, where B is the block size, and K is the number of reported points. Our approach is novel in that, while it provides a theoretical upper bound on the average query time, it avoids the use of complicated data structures, making it an effective candidate for practical applications. The proposed index is also dynamic in the sense that it allows object insertion and deletion in an amortized update cost of log B(N) I/O's. Experimental results are presented to show the superiority of the proposed index over other methods based on R-trees. recommend Ahmed Elmagarmid  相似文献   

Database systems are becoming increasingly popular for answering queries. Partial-match search queries are an important class of queries in such a system. Several storage structures have been proposed to answer these queries efficiently. The BD tree is an example of such a storage structure. A previous study indicated that the k-d tree performance is better than that of the BD tree for partial-match search queries. A recent paper reported some improved algorithms. However, it is unclear whether the improved algorithms show the BD tree in a favourable light for partial-match search queries. This paper explores the performance of these algorithms and compares their performance to that of the k-d tree. Since the BD tree construction process uses some heuristics to make it a better balanced tree, this paper also evaluates the effect of these heuristics on the partial-match search algorithms. The major conclusions of this study are that the BD tree performance for partial-match search is better than that of the k-d tree when an improved algorithm is used for partial-match search, and only the DZ expression rearrangement heuristic has substantial effect on partial-match search performance.  相似文献   

An important research issue in multimedia databases is the retrieval of similar objects. For most applications in multimedia databases, an exact search is not meaningful. Thus, much effort has been devoted to develop efficient and effective similarity search techniques. A recent approach that has been shown to improve the effectiveness of similarity search in multimedia databases resorts to the usage of combinations of metrics (i.e., a search on a multi-metric space). In this approach, the desirable contribution (weight) of each metric is chosen at query time. It follows that standard metric indexes cannot be directly used to improve the efficiency of dynamically weighted queries, because they assume that there is only one fixed distance function at indexing and query time. This paper presents a methodology for adapting metric indexes to multi-metric indexes, that is, to support similarity queries with dynamic combinations of metric functions. The adapted indexes are built with a single distance function and store partial distances to estimate the dynamically weighed distances. We present two novel indexes for multimetric space indexing, which are the result of the application of the proposed methodology.  相似文献   

Due to the inherent existence of uncertainty in many real-world applications, in this paper, we investigate an important query in uncertain databases, namely probabilistic least influenced set (PLIS) query, which retrieves all the uncertain objects in an uncertain database such that they are the least affected by a given query object with high probabilities. Such a PLIS query is useful in applications such as business planning. We propose and tackle both monochromatic and bichromatic versions (i.e. M-PLIS and B-PLIS, respectively) of the PLIS query. In order to efficiently answer PLIS queries, we present three pruning methods, MINMAX, Regional, and Candidate pruning, which can effectively reduce the PLIS search space. The proposed pruning methods can be seamlessly integrated into efficient query procedures. Moreover, we also study important variants of PLIS query with uncertain query object (i.e. UQ-PLIS). Furthermore, we formulate and tackle the PLIS problem on uncertain moving objects (i.e. UMOD-PLIS). Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches under various settings.  相似文献   

The content-based cross-media retrieval is a new type of multimedia retrieval in which the media types of query examples and the returned results can be different. In order to learn the semantic correlations among multimedia objects of different modalities, the heterogeneous multimedia objects are analyzed in the form of multimedia document (MMD), which is a set of multimedia objects that are of different media types but carry the same semantics. We first construct an MMD semi-semantic graph (MMDSSG) by jointly analyzing the heterogeneous multimedia data. After that, cross-media indexing space (CMIS) is constructed. For each query, the optimal dimension of CMIS is automatically determined and the cross-media retrieval is performed on a per-query basis. By doing this, the most appropriate retrieval approach for each query is selected, i.e. different search methods are used for different queries. The query dependent search methods make cross-media retrieval performance not only accurate but also stable. We also propose different learning methods of relevance feedback (RF) to improve the performance. Experiment is encouraging and validates the proposed methods.  相似文献   

In this paper, we propose an efficient solution for processing continuous range spatial keyword queries over moving spatio-textual objects (namely, CRSK-mo queries). Major challenges in efficient processing of CRSK-mo queries are as follows: (i) the query range is determined based on both spatial proximity and textual similarity; thus a straightforward spatial proximity based pruning of the search space is not applicable as any object far from a query location with a high textual similarity score can still be the answer (and vice versa), (ii) frequent location updates may invalidate a query result, and thus require frequent re-computing of the result set for any object updates. To address these challenges, the key idea of our approach is to exploit the spatial and textual upper bounds between queries and objects to form safe zones (at the client-side) and buffer regions (at the server-side), and then use these bounds to quickly prune objects and queries through smart in-memory data structures. We conduct extensive experiments with a synthetic dataset that verify the effectiveness and efficiency of our proposed algorithm.  相似文献   

Reverse nearest neighbors in large graphs   总被引:3,自引:0,他引:3  
A reverse nearest neighbor (RNN) query returns the data objects that have a query point as their nearest neighbor (NN). Although such queries have been studied quite extensively in Euclidean spaces, there is no previous work in the context of large graphs. In this paper, we provide a fundamental lemma, which can be used to prune the search space while traversing the graph in search for RNN. Based on it, we develop two RNN methods; an eager algorithm that attempts to prune network nodes as soon as they are visited and a lazy technique that prunes the search space when a data point is discovered. We study retrieval of an arbitrary number k of reverse nearest neighbors, investigate the benefits of materialization, cover several query types, and deal with cases where the queries and the data objects reside on nodes or edges of the graph. The proposed techniques are evaluated in various practical scenarios involving spatial maps, computer networks, and the DBLP coauthorship graph.  相似文献   

Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号