首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
With the Internet growing exponentially, search engines are encountering unprecedented challenges. A focused search engine selectively seeks out web pages that are relevant to user topics. Determining the best strategy to utilize a focused search is a crucial and popular research topic. At present, the rank values of unvisited web pages are computed by considering the hyperlinks (as in the PageRank algorithm), a Vector Space Model and a combination of them, and not by considering the semantic relations between the user topic and unvisited web pages. In this paper, we propose a concept context graph to store the knowledge context based on the user's history of clicked web pages and to guide a focused crawler for the next crawling. The concept context graph provides a novel semantic ranking to guide the web crawler in order to retrieve highly relevant web pages on the user's topic. By computing the concept distance and concept similarity among the concepts of the concept context graph and by matching unvisited web pages with the concept context graph, we compute the rank values of the unvisited web pages to pick out the relevant hyperlinks. Additionally, we constitute the focused crawling system, and we retrieve the precision, recall, average harvest rate, and F-measure of our proposed approach, using Breadth First, Cosine Similarity, the Link Context Graph and the Relevancy Context Graph. The results show that our proposed method outperforms other methods.  相似文献   

2.
Keyword query processing over graph structured data is beneficial across various real world applications. The basic unit, of search and retrieval, in keyword search over graph, is a structure (interconnection of nodes) that connects all the query keywords. This new answering paradigm, in contrast to single web page results given by search engines, brings forth new challenges for ranking. In this paper, we propose a simple but effective Fuzzy set theory based Ranking measure, called FRank. Fuzzy sets acknowledge the contribution of each individual query keyword, discretely, to enumerate node relevance. A novel aggregation operator is defined, to combine the content relevance based fuzzy sets and, compute query dependent edge weights. The final rank, of an answer, is computed by non-monotonic addition of edge weights, as per their relevance to keyword query. FRank evaluates each answer based on the distribution of query keywords and structural connectivity between those keywords. An extensive empirical analysis shows superior performance by our proposed ranking measure as compared to the ranking measures adopted by current approaches in the literature.  相似文献   

3.
Graph representations of data are increasingly common. Such representations arise in a variety of applications, including computational biology, social network analysis, web applications, and many others. There has been much work in recent years on developing learning algorithms for such graph data; in particular, graph learning algorithms have been developed for both classification and regression on graphs. Here we consider graph learning problems in which the goal is not to predict labels of objects in a graph, but rather to rank the objects relative to one another; for example, one may want to rank genes in a biological network by relevance to a disease, or customers in a social network by their likelihood of being interested in a certain product. We develop algorithms for such problems of learning to rank on graphs. Our algorithms build on the graph regularization ideas developed in the context of other graph learning problems, and learn a ranking function in a reproducing kernel Hilbert space (RKHS) derived from the graph. This allows us to show attractive stability and generalization properties. Experiments on several graph ranking tasks in computational biology and in cheminformatics demonstrate the benefits of our framework.  相似文献   

4.
基于图的特征词权重算法及其在文档排序中的应用   总被引:1,自引:0,他引:1  
信息检索的核心工作包括文档的分类和排序等操作,如何对文档中的特征词权重进行有效度量是其中的一项关键技术。利用词的共现等关系为每个文档建立文本图,基于邻接词间重要性相互影响的思路,结合文档中特征词的词频特性,迭代计算每个词的权重,进一步结合文本图的密度等全局特性,对信息检索的结果进行排序。实验证实,算法在标准数据集上具有良好的效果。  相似文献   

5.
Most entity ranking research aims to retrieve a ranked list of entities from a Web corpus given a user query. The rank order of entities is determined by the relevance between the query and contexts of entities. However, entities can be ranked directly based on their relative importance in a document collection, independent of any queries. In this paper, we introduce an entity ranking algorithm named NERank+. Given a document collection, NERank+ first constructs a graph model called Topical Tripartite Graph, consisting of document, topic and entity nodes. We design separate ranking functions to compute the prior ranks of entities and topics, respectively. A meta-path constrained random walk algorithm is proposed to propagate prior entity and topic ranks based on the graph model.We evaluate NERank+ over real-life datasets and compare it with baselines. Experimental results illustrate the effectiveness of our approach.  相似文献   

6.
陈伟柱  陈英  吴燕 《计算机应用》2005,25(5):995-997,1003
提出了一种基于分类技术的搜索引擎新排名算法CategoryRank。该算法能够借助类别信息,更加准确地计算网页的排名得分,提高搜索引擎排名的准确性。算法基于任意两个网页之间的类别信息,对链接图进行了分析和计算,并且与PageRank等算法进行相比,该算法能够更加准确地模拟用户浏览网页的习惯。同时针对Web中的每个网页,算法计算出它的类别属性,直接体现了该页面针对不同用户的重要程度。最后,把该算法的离线模型扣在线模型统一起来,阐明了算法在搜索引擎排名中的运行机制。  相似文献   

7.
流形排序算法预测microRNA*   总被引:1,自引:0,他引:1  
在已知microRNA(miRNA)较少的情况下,为了提高算法预测的准确性,提出一种基于流形排序的miR-NA预测算法。该算法采用加权图模型描述序列,使用置信传播分配排序分数,降低了算法的时间复杂度;算法根据大规模数据内部全局流形结构进行排序,提高了排序结果的准确性。在人类和按蚊全基因组范围内的实验证明,流形排序算法的预测效果优于传统的预测方法,可以作为预测miRNA的一个有效工具。  相似文献   

8.
由于网页质量千差万别,对网页进行基于网络链接图的质量排序变成了现代搜索引擎的一个重要部件.分析了对网络排序模块的实现进行优化时,造成大规模稀疏矩阵-向量乘法运算低效的原因,并结合网络链接图的实际情况提出了几种不同的优化策略.然后,对几种优化策略做了实验性能比较,并综合考虑各种优化策略的运算效率和存储量需求,选择了适合实际系统的优化策略.同时,提出PageRank算法在实现时的一个变通处理--除汇.  相似文献   

9.
邹兆年  高宏  李建中  张硕 《软件学报》2010,21(5):1007-1019
探讨演变图(即随时间变化的图)的挖掘,重点研究在演变图中挖掘连接子图的演变模式集合.提出一种连接子图的相似度函数及其快速计算算法.基于该相似度函数,提出一种发现演变模式集合的多项式时间复杂度的动态规划算法.模拟数据集上的实验结果表明,该算法具有较低的误差率和较高的效率.真实数据集上的实验结果表明,挖掘结果在真实应用中具有实际意义.  相似文献   

10.
The main objective of “time series analysis” is to discover the underlying structure of the time series, and thus, become able to forecast its “future values”. This process makes it possible to predict, control or simulate variables. Most of the time series modelling procedures try to forecast future values from lagged ones. Thus, the selection of the relevant lagged values to be used is a key step. In this paper, a new consensus method for the selection of relevant lagged values of a time series is introduced: feature ranking aggregated selection (FRASel). The main contribution of this feature selection method is the definition of a consensus decision making mechanism based on aggregation and expressed as a simple rule. In FRASel, the selected subset of lagged values is decided by the application of an aggregation criterion to the results of different flavours of feature ranking methods, applied from different approaches. A thorough empirical analysis is carried out to assess the performance of FRASel. The statistical significance of the experimental results is also analysed through the application of non-parametric statistical tests.  相似文献   

11.
In this paper, we consider the problem of clustering and re-ranking web image search results so as to improve diversity at high ranks. We propose a novel ranking framework, namely cluster-constrained conditional Markov random walk (CCCMRW), which has two key steps: first, cluster images into topics, and then perform Markov random walk in an image graph conditioned on constraints of image cluster information. In order to cluster the retrieval results of web images, a novel graph clustering model is proposed in this paper. We explore the surrounding text to mine the correlations between words and images and therefore the correlations are used to improve clustering results. Two kinds of correlations, namely word to image and word to word correlations, are mainly considered. As a standard text process technique, tf-idf method cannot measure the correlation of word to image directly. Therefore, we propose to combine tf-idf method with a novel feature of word, namely visibility, to infer the word-to-image correlation. By latent Dirichlet allocation model, we define a topic relevance function to compute the weights of word-to-word correlations. Taking word to image correlations as heterogeneous links and word-to-word correlations as homogeneous links, graph clustering algorithms, such as complex graph clustering and spectral co-clustering, are respectively used to cluster images into topics in this paper. In order to perform CCCMRW, a two-layer image graph is constructed with image cluster nodes as upper layer added to a base image graph. Conditioned on the image cluster information from upper layer, Markov random walk is constrained to incline to walk across different image clusters, so as to give high rank scores to images of different topics and therefore gain the diversity. Encouraging clustering and re-ranking outputs on Google image search results are reported in this paper.  相似文献   

12.
An edge ranking of a graph G is a labeling r of its edges with positive integers such that every path between two different edges eu, ev with the same rank r(eu)=r(ev) contains an intermediate edge ew with rank r(ew)>r(eu). An edge ranking of G is minimum if the largest rank k assigned is the smallest among all rankings of G. The edge ranking problem is to find a minimum edge ranking of given graph G. This problem is NP-hard and no polynomial time algorithm for solving it is known for non-trivial classes of graphs other than the class of trees. In this paper, we first show, on a general graph G, a relation between a minimum edge ranking of G and its minimal cuts, which ensures that we can obtain a polynomial time algorithm for obtaining minimum edge ranking of a given graph G if minimal cuts for each subgraph of G can be found in polynomial time and the number of those is polynomial. Based on this relation, we develop a polynomial time algorithm for finding a minimum edge ranking on a 2-connected outerplanar graph.  相似文献   

13.
复杂网络节点重要性排序是研究复杂网络特性的重要方面之一,被广泛应用于数据挖掘、Web搜索、社会网络分析等众多研究领域。基于物理学场论模型,提出改进的随机游走模式的节点重要性排序算法,即通过节点之间相互作用的场力来确定随机游走模型中的Markov转移矩阵,这样可以对节点重要性排序作出更加准确真实的评估。实验结果表明,所采用的节点重要性评估方法能更合理地解释节点重要性的意义,并且可以给出更加真实精确的节点重要性的评估结果。  相似文献   

14.
邹兆年  高宏  李建中  张硕 《软件学报》2010,21(4):1007-1019
探讨演变图(即随时间变化的图)的挖掘,重点研究在演变图中挖掘连接子图的演变模式集合.提出一种连 接子图的相似度函数及其快速计算算法.基于该相似度函数,提出一种发现演变模式集合的多项式时间复杂度的动 态规划算法.模拟数据集上的实验结果表明,该算法具有较低的误差率和较高的效率.真实数据集上的实验结果表 明,挖掘结果在真实应用中具有实际意义.  相似文献   

15.
Search engines result pages (SERPs) for a specific query are constructed according to several mechanisms. One of them consists in ranking Web pages regarding their importance, regardless of their semantic. Indeed, relevance to a query is not enough to provide high quality results, and popularity is used to arbitrate between equally relevant Web pages. The most well-known algorithm that ranks Web pages according to their popularity is the PageRank.The term Webspam was coined to denotes Web pages created with the only purpose of fooling ranking algorithms such as the PageRank. Indeed, the goal of Webspam is to promote a target page by increasing its rank. It is an important issue for Web search engines to spot and discard Webspam to provide their users with a nonbiased list of results. Webspam techniques are evolving constantly to remain efficient but most of the time they still consist in creating a specific linking architecture around the target page to increase its rank.In this paper we propose to study the effects of node aggregation on the well-known ranking algorithm of Google (the PageRank) in the presence of Webspam. Our node aggregation methods have the purpose to construct clusters of nodes that are considered as a sole node in the PageRank computation. Since the Web graph is way to big to apply classic clustering techniques, we present four lightweight aggregation techniques suitable for its size. Experimental results on the WEBSPAM-UK2007 dataset show the interest of the approach, which is moreover confirmed by statistical evidence.  相似文献   

16.
Topic-based ranking in Folksonomy via probabilistic model   总被引:1,自引:0,他引:1  
Social tagging is an increasingly popular way to describe and classify documents on the web. However, the quality of the tags varies considerably since the tags are authored freely. How to rate the tags becomes an important issue. Most social tagging systems order tags just according to the input sequence with little information about the importance and relevance. This limits the applications of tags such as information search, tag recommendation, and so on. In this paper, we pay attention to finding the authority score of tags in the whole tag space conditional on topics and put forward a topic-sensitive tag ranking (TSTR) approach to rank tags automatically according to their topic relevance. We first extract topics from folksonomy using a probabilistic model, and then construct a transition probability graph. Finally, we perform random walk over the topic level on the graph to get topic rank scores of tags. Experimental results show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into tag recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications.  相似文献   

17.
网络操作中收集了大量的系统日志数据,找出精确的系统故障成为重要的研究方向.提出一种条件因果挖掘算法(CCMA),通过从日志消息中生成一组时间序列数据,分别用傅里叶分析和线性回归分析删除大量无关的周期性时间序列后,利用因果推理算法输出有向无环图,通过检测无环图的边缘分布,消除冗余关系得出最终结果.仿真结果表明,对比依赖挖掘算法(DMA)和网络信息关联与探索算法(NICE),CCMA算法在处理时间和边缘相关率2个主要性能指标方面均有改善,表明CCMA算法在日志事件挖掘中能有效优化处理速度和挖掘精度.  相似文献   

18.
Both the quality and quantity of training data have significant impact on the accuracy of rank functions in web search. With the global search needs, a commercial search engine is required to expand its well tailored service to small countries as well. Due to heterogeneous intrinsic of query intents and search results on different domains (i.e., for different languages and regions), it is difficult for a generic ranking function to satisfy all type of queries. Instead, each domain should use a specific well tailored ranking function. In order to train each ranking function for each domain with a scalable strategy, it is critical to leverage existing training data to enhance the ranking functions of those domains without sufficient training data. In this paper, we present a boosting framework for learning to rank in the multi-task learning context to attack this problem. In particular, we propose to learn non-parametric common structures adaptively from multiple tasks in a stage-wise way. An algorithm is developed to iteratively discover super-features that are effective for all the tasks. The estimation of the regression function for each task is then learned as linear combination of those super-features. We evaluate the accuracy of multi-task learning methods for web search ranking using data from multiple domains from a commercial search engine. Our results demonstrate that multi-task learning methods bring significant relevance improvements over existing baseline method.  相似文献   

19.
C-Rank:一种Deep Web数据记录可信度评估方法   总被引:1,自引:0,他引:1  
针对Web信息可信度问题,提出了一种为Deep Web数据记录计算可信度的有效方法C-Rank。该方法为每一条记录构造一个S-R可信度网络,包含两种类型顶点及三种类型边。首先基于可信度传播的思想,利用顶点出度为每一个顶点计算其局部可信度值;再利用Record顶点入度及相邻Site顶点的可信度值,为该Record顶点计算权值;继而求得整个S-R网络的全局可信度值。实验证明,C-Rank方法能够合理而有效地评价数据记录的可信度,从而达到甄别虚假信息,为用户推荐可信数据记录的目的。该方法普遍适用于Deep Web的各个领域。  相似文献   

20.
基于分层法的通风网络图绘制算法研究   总被引:1,自引:0,他引:1  
最长路径法绘制通风网络图需要频繁地搜索任意两个节点之间的最长路径,采用深度优先搜索导致大量的时间浪费在无用路径的搜索过程中;且采用几何相交方法判断分支交叉,效率低且无法有效地减少分支交叉数。提出了将分层法引入到通风网络图绘制中。采用最长路径法对网络图进行节点分层,求解整数规划问题优化节点分层减少长边;采用模拟退火遗传算法优化节点排序,从拓扑上减少分支交叉数。为了减少无意义地搜索最长路径过程,采用最长路径并联通路法计算节点坐标和分支形状。给出了基于分层法的通风网络图绘制的测试例子。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号