首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Keyword search is the most popular technique of searching information from XML (eXtensible markup language) document. It enables users to easily access XML data without learning the structure query language or studying the complex data schemas. Existing traditional keyword query methods are mainly based on LCA (lowest common ancestor) semantics, in which the returned results match all keywords at the granularity of elements. In many practical applications, information is often uncertain and vague. As a result, how to identify useful information from fuzzy data is becoming an important research topic. In this paper, we focus on the issue of keyword querying on fuzzy XML data at the granularity of objects. By introducing the concept of “object tree”, we propose the query semantics for keyword query at object-level. We find the minimum whole matching result object trees which contain all keywords and the partial matching result object trees which contain partial keywords, and return the root nodes of these result object trees as query results. For effectively and accurately identifying the top-K answers with the highest scores, we propose a score mechanism with the consideration of tf*idf document relevance, users’ preference and possibilities of results. We propose a stack-based algorithm named object-stack to obtain the top-K answers with the highest scores. Experimental results show that the object-stack algorithm outperforms the traditional XML keyword query algorithms significantly, and it can get high quality of query results with high search efficiency on the fuzzy XML document.  相似文献   

2.
现有的XML关键字查询算法,通常只考虑节点间的结构信息,以包含关键字匹配节点的子树作为查询的结果,而节点间的语义相关性一直没有被充分利用。这也是导致现有查询算法的结果中普遍含有大量语义无关的冗余信息的主要原因。在该文中,我们首先对查询关键字的环境语义及节点间的语义相关性进行了定义,在此基础上,提出了一种新的关键字查询算法,寻找语义相关单元作为关键字查询的结果。这样获得的查询结果,一方面不含语义无关的冗余信息,另一方面也与用户的查询意图更加匹配。实验表明,该文提出的算法在查询效率和精确性上都有较大改进。  相似文献   

3.
Keyword query processing over graph structured data is beneficial across various real world applications. The basic unit, of search and retrieval, in keyword search over graph, is a structure (interconnection of nodes) that connects all the query keywords. This new answering paradigm, in contrast to single web page results given by search engines, brings forth new challenges for ranking. In this paper, we propose a simple but effective Fuzzy set theory based Ranking measure, called FRank. Fuzzy sets acknowledge the contribution of each individual query keyword, discretely, to enumerate node relevance. A novel aggregation operator is defined, to combine the content relevance based fuzzy sets and, compute query dependent edge weights. The final rank, of an answer, is computed by non-monotonic addition of edge weights, as per their relevance to keyword query. FRank evaluates each answer based on the distribution of query keywords and structural connectivity between those keywords. An extensive empirical analysis shows superior performance by our proposed ranking measure as compared to the ranking measures adopted by current approaches in the literature.  相似文献   

4.
黎玲利  王宏志  高宏  李建中 《软件学报》2012,23(6):1561-1577
利用关键字可以在模式未知的情况下对XML数据进行查询.在当前的XML数据流上的关键字查询处理中,打分函数往往不能都满足各种用户不同的需求.提出了一种基于skyline的XML数据流上的Top-K关键字查询.对于这种查询,不需要考虑影响结果与查询相关性的复杂因素,只需利用skyline挑选与查询最相关的结果.提出了两种XML数据流上的有效的基于skyline的Top-K关键查询处理算法,包括对单查询和多查询的处理算法.通过扩展实验对两种算法的有效性和可扩展性进行了验证.经过实验验证,所提出的查询处理算法的效率几乎不受关键字个数、查询结果数量、查询数量等参数的影响,运行时间和文档大小大致呈线性关系.  相似文献   

5.
Existing work of XML keyword search focus on how to find relevant and meaningful data fragments for a query, assuming each keyword is intended as part of it. However, in XML keyword search, user queries usually contain irrelevant or mismatched terms, typos etc, which may easily lead to empty or meaningless results. In this paper, we introduce the problem of content-aware XML keyword query refinement, where the search engine should judiciously decide whether a user query Q needs to be refined during the processing of Q, and find a list of promising refined query candidates which guarantee to have meaningful matching results over the XML data, without any user interaction or a second try. To achieve this goal, we build a novel content-aware XML keyword query refinement framework consisting of two core parts: (1) we build a query ranking model to evaluate the quality of a refined query RQ, which captures the morphological/semantical similarity between Q and RQ and the dependency of keywords of RQ over the XML data; (2) we integrate the exploration of RQ candidates and the generation of their matching results as a single problem, which is fulfilled within a one-time scan of the related keyword inverted lists optimally. Finally, an extensive empirical study verifies the efficiency and effectiveness of our framework.  相似文献   

6.
Keyword search in XML documents has recently gained a lot of research attention. Given a keyword query, existing approaches first compute the lowest common ancestors (LCAs) or their variants of XML elements that contain the input keywords, and then identify the subtrees rooted at the LCAs as the answer. In this the paper we study how to use the rich structural relationships embedded in XML documents to facilitate the processing of keyword queries. We develop a novel method, called SAIL, to index such structural relationships for efficient XML keyword search. We propose the concept of minimal-cost trees to answer keyword queries and devise structure-aware indices to maintain the structural relationships for efficiently identifying the minimal-cost trees. For effectively and progressively identifying the top-k answers, we develop techniques using link-based relevance ranking and keyword-pair-based ranking. To reduce the index size, we incorporate a numbering scheme, namely schema-aware dewey code, into our structure-aware indices. Experimental results on real data sets show that our method outperforms state-of-the-art approaches significantly, in both answer quality and search efficiency.  相似文献   

7.
空间关键词搜索立足于查找满足用户查询意图且空间距离相近的兴趣点(point of interest, POI),在地图搜索等领域有着广泛的应用.传统的空间关键词搜索方法仅考虑关键词与POI点在文本上的匹配程度,忽略了查询的语义信息,因而会导致相关结果丢失以及无关结果引入等问题.针对传统方法的局限,提出了语义增强的空间关键词搜索方法S3(semantic-enhanced spatial keyword search).该方法对查询关键词中包含的语义信息进行分析,并结合语义相关性和空间距离对POI点进行有效的排序.S3方法主要有以下2个技术挑战:1)如何对语义信息进行分析.为此,S3引入了知识库对POI数据进行语义扩充,并提出了一种基于图的语义距离度量方式.结合语义距离和空间距离,S3给出POI点的综合排序方案.2)如何在大规模数据上即时地返回top-k搜索结果.针对这一挑战,提出了一种新型的语义-空间混合索引结构GRTree(graph rectangle tree),并研究了有效的剪枝策略.在大规模真实数据集上的实验表明,S3不仅能够返回更为相关的结果,而且有着很好的效率和可扩展性.  相似文献   

8.
李婷  程海涛 《计算机科学》2017,44(9):216-221, 226
在精确XML文档上的关键字查询方法的研究大多是基于LCA语义或者其变种语义(SLCA,ELCA等)开展的,将包含所有关键字的最紧致XML子树片段作为查询结果返回。但是这些基于LCA语义产生的查询结果中通常包含了大量的冗余信息,现实世界中存在着大量的不确定和模糊信息,因而如何从模糊XML文档中搜索到高质量的关键字查询结果是一个需要研究的问题。针对模糊XML文档上的关键字近似查询方法进行研究,通过引入最小连接树(MCT)的概念,提出在模糊XML文档上关键字查询的所有GDMCTs问题,并给出解决这一问题的基于栈的算法All fuzzy GDMCTs,该算法可以得到满足用户指定的子树大小阈值和可能性阈值条件的所有GDMCTs结果。实验表明,该算法在模糊XML文档上能够得到较高质量的关键字查询结果。  相似文献   

9.
Keyword search enables inexperienced users to easily search XML database with no specific knowledge of complex structured query languages and XML data schemas. Existing work has addressed the problem of selecting data nodes that match keywords and connecting them in a meaningful way, e.g., SLCA and ELCA. However, it is time-consuming and unnecessary to serve all the connected subtrees to the users because in general the users are only interested in part of the relevant results. In this paper, we propose a new keyword search approach which basically utilizes the statistics of underlying XML data to decide the promising result types and then quickly retrieves the corresponding results with the help of selected promising result types. To guarantee the quality of the selected promising result types, we measure the correlations between result types and a keyword query by analyzing the distribution of relevant keywords and their structures within the XML data to be searched. In addition, relevant result types can be efficiently computed without keyword query evaluation and any schema information. To directly return top-k keyword search results that conform to the suggested promising result types, we design two new algorithms to adapt to the structural sensitivity of the keyword nodes over the keyword search results. Lastly, we implement all proposed approaches and present the relevant experimental results to show the effectiveness of our approach.  相似文献   

10.
Keyword proximity search in XML trees   总被引:3,自引:0,他引:3  
Recent works have shown the benefits of keyword proximity search in querying XML documents in addition to text documents. For example, given query keywords over Shakespeare's plays in XML, the user might be interested in knowing how the keywords cooccur. In this paper, we focus on XML trees and define XML keyword, proximity queries to return the (possibly heterogeneous) set of minimum connecting trees (MCTs) of the matches to the individual keywords in the query. We consider efficiently executing keyword proximity queries on labeled trees (XML) in various settings: 1) when the XML database has been preprocessed and 2) when no indices are available on the XML database. We perform a detailed experimental evaluation to study the benefits of our approach and show that our algorithms considerably outperform prior algorithms and other applicable approaches.  相似文献   

11.
Extensible Markup Language (XML) is commonly employed to represent and transmit information over the Internet. Therefore, how to effectively search for keywords of massive XML data becomes a new issue. In this paper, we first present four properties to improve the classical ILE algorithm. Then, a kind of parallel XML keyword search algorithm, based on intelligent grouping to calculate SLCA, is proposed and realized under MapReduce programming model. At last, a series of experiments are implemented on 7 datasets of different sizes. The obtained results indicate that the proposed algorithm has high execution efficiency and is applicable to keyword search of massive XML data.  相似文献   

12.
XML关键字查询结果质量不高的一个很重要的原因是查询关键词难以反映用户真实的查询意图,而给关键词设置权重在一定程度上可以解决这一难题. 本文结合关键字之间的结构关系提出了一种新的结果排序方法,该方法给查询关键词设置权重,并参照查询关键词的权重给包含关键字的结点设定结点权重,然后根据关系树中的结点权重和关键词之间结构关系[1]统计SLCA结点的重要程度,再以此依据对查询结果进行排序,最后返回给用户有序的查询结果. 实验结果和分析表明,提出的排序方法具有较高的准确率,能够较好地满足用户查询的需求和偏好.  相似文献   

13.
基于权重查询词的XML结构查询扩展   总被引:9,自引:0,他引:9  
万常选  鲁远 《软件学报》2008,19(10):2611-2619
文本文档信息检索中检索质量不高的一个主要原因是用户难以提出准确的描述查询意图的查询表达式. 而XML文档除了具有文本文档的内容特征外,还具有结构特征,导致用户更难以提出准确的查询表达式.为了解决这一问题,提出一种基于相关反馈的查询扩展方法,可以帮助用户构建满足查询意图的"内容 结构"的查询表达式.该方法首先进行查询词扩展,找到最能代表用户查询意图的权重扩展查询词;然后在扩展查询词的基础上进行结构查询扩展;最终形成完整的"内容 结构"的查询扩展表达式.实验结果表明,与未进行查询扩展相比,扩展后prec@10和prec@20的平均准确率提高30%以上.  相似文献   

14.
用户使用关键字查询时可能不能准确地表达他们的意图,即使用户正确地表达了查询意图,查询引擎也可能不能准确地返回查询结果.针对这一问题,重点研究了在XML关键字查询中如何进行有效的查询改写并生成有意义的结果.提出4种查询改写操作和查询改写代价的概念,给出了动态规划的方法计算查询改写代价.为了找出最优的查询改写,给出了基于栈的查询改写和结果生成算法,并提出了基于划分的优化算法.最后通过丰富的实验对提出的方法进行了验证.  相似文献   

15.
Searching XML data using keyword queries has attracted much attention because it enables Web users to easily access XML data without having to learn a structured query language or study possibly complex data schemas. Most of the current approaches identify the meaningful results of a given keyword query based on the semantics of lowest common ancestor (LCA) and its variants. However, given the fact that LCA candidates are usually numerous and of low relevance to the users?? information need, how to effectively and efficiently identify the most relevant results from a large number of LCA candidates is still a challenging and unresolved issue. In this article, we introduce a novel semantics of relevant results based on mutual information between the query keywords. Then, we introduce a novel approach for identifying the relevant answers of a given query by adopting skyline semantics. We also recommend three different ranking criteria for selecting the top-k relevant results of the query. Efficient algorithms are proposed which rely on some provable properties of the dominance relationship between result candidates to rapidly identify the top-k dominant results. Extensive experiments were conducted to evaluate our approach and the results show that the proposed approach has a good performance compared with other existing approaches in different data sets and evaluation metrics  相似文献   

16.
XML关键字查询是一个用户比较方便的信息搜索方法,非常适用于用户在不熟悉XML查询语言和底层结构的情况下进行信息查询。现有的XML数据流上关键字查询多采用查找SLCA结果集的方式,为了解决基于SLCA结果集定义的不完备性,引入了基于XLCA的结果集定义,使其查询包含尽可能全的结果。文中对于XML数据流提出利用滑动窗口模型保存数据,基于XLCA的结果集定义,提出了一种TOP-K关键字查询算法,并从理论上证明了此算法的正确性和查询的完备性,分析了其时间复杂性和空间复杂性。  相似文献   

17.
Keyword search enables web users to easily access XML data without understanding the complex data schemas. However, the native ambiguity of keyword search makes it arduous to select qualified relevant results matching keywords. To solve this problem, researchers have made much effort on establishing ranking models distinguishing relevant and irrelevant passages, such as the highly cited TF*IDF and BM25. However, these statistic based ranking methods mostly consider term frequency, inverse document frequency and length as ranking factors, ignoring the distribution and connection information between different keywords. Hence, these widely used ranking methods are powerless on recognizing irrelevant results when they are with high term frequency, indicating a performance limitation. In this paper, a new searching system XDist is accordingly proposed to attack the problems aforementioned. In XDist, we firstly use the semantic query model maximal lowest common ancestor (MAXLCA) to recognize the returned results of a given query, and then these candidate results are ranked by BM25. Especially, XDist re-ranks the top several results by a combined distribution measurement (CDM) which considers four measure criterions: term proximity, intersection of keyword classes, degree of integration among keywords and quantity variance of keywords. The weights of the four measures in CDM are trained by a listwise learning to optimize method. The experimental results on the evaluation platform of INEX show that the re-ranking method CDM can effectively improve the performance of the baseline BM25 by 22% under iP[0.01] and 18% under MAiP. Also the semantic model MAXLCA and the search engine XDist perform the best in their respective related fields.  相似文献   

18.
为了进一步提高网页相关性判断的速度和准确率,提出了一种新的用于聚焦文摘的句子权重计算方法。在查询返回的结果集的基础上,通过计算关键词间的互信息,对输入的查询语句进行短语识别;利用网页文本中的标签信息,对网页结构进行分析,并将关键词短语和网页结构等信息融入句子权重计算。实验结果表明,基于该算法生成的查询摘要在相关性判断的速度和准确率等方面均优于现有方法。  相似文献   

19.
梁银  董永权 《计算机应用》2014,34(7):1992-1996
在进行空间关键词查询时,有时需要查找一组既紧凑且离查询点最近、又覆盖查询关键词且对象个数很少的对象,而现有的查询方法通常只能返回包含所有查询关键词的单个空间对象。为此,提出了解决此类查询问题的近似查询算法和精确查询算法。首先给出了这类查询问题的形式化定义,以及描述对象集合质量的代价函数,并对代价函数进行了归一化处理;然后在近似查询算法中采用基于IR-tree的最佳优先搜索策略进行剪枝,有效缩减了查询候选空间;在精确查询算法中采用基于IR-tree的广度优先搜索策略查找包含查询关键词的对象,以达到降低查询处理代价的目的。实验结果表明,近似算法的查询效率明显优于精确算法,且能获得非常精确的查询结果。  相似文献   

20.
As a large number of corpuses are represented, stored and published in XML format, how to find useful information from XML databases has become an increasingly important issue. Keyword search enables web users to easily access XML data without the need to learn a structured query language or to study complex data schemas. Most existing indexing strategies for XML keyword search are based upon Dewey encoding. In this paper, we proposed a new encoding method called Level Order and Father (LAF) for XML documents. With LAF encoding, we devised a new index structure, called two‐layer LAF inverted index, which can greatly decrease the space complexity compared with Dewey encoding‐based inverted index. Furthermore, with two‐layer LAF inverted index, we proposed a new keyword query algorithm called Algorithm based on Binary Search (ABS) that can quickly find all Smallest Lowest Common Ancestor. We experimentally evaluate two‐layer LAF inverted index and ABS algorithm on four real XML data sets selected from Wikipedia. The experimental results prove the advantages of our index method and querying algorithm. The space consumed by two‐layer LAF index is less than half of that consumed by Dewey inverted index. Moreover, ABS is about one to two orders of magnitude faster than the classic Stack algorithm. Concurrency and Computation: Practice and Experience, 2012.© 2012 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号