首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
介绍了目前应用较为广泛的两种算法——PageRank算法和HITS算法。PageRank算法是基于用户随机的向前浏览网页的直觉知识,HITS算法考虑的是Authoritive网页和Hub网页间的加强关系。PageRank算法的基本思想是:如果一个页面被许多其他页面引用,则这个页面很可能是重要页面;一个页面尽管没有被多次引用,但被一个重要页面引用,那么这个页面很可能也是重要页面;一个页面的重要性被均分并传递到它所引用的页面。而HITS算法则专注于改善泛指主题检索的结果,通过一定的计算(迭代计算)方法以得到针对某个检索提问的最具价值的网页,即排名最高的authority。  相似文献   

2.
Current models of web navigation focus only on the influence of textual information and ignore the role of graphical information. We studied the differential role of text and graphics in identifying web page widgets classified into two kinds: textual and graphical. Four different versions of web pages were created by systematically removing textual and graphical information from each page. The task of the participants was to locate either textual or graphical widgets on the displayed web page. Results show that for any widget, the task-completion time and the number of clicks were significantly less in web pages with graphics than in those with no graphics. This demonstrates the importance of graphical information. However, textual information is also important because performance in locating graphical widgets under no-graphics conditions was better when text was present than with no text. Since, for identifying graphical widgets, text and graphics interact and complement each other, we conclude that cognitive models on web navigation should include the role of graphical information next to textual information.  相似文献   

3.
随着Web技术的发展和Web上越来越多的各种信息,如何提供高质量、相关的查询结果成为当前Web搜索引擎的一个巨大挑战.PageRank和HITS是两个最重要的基于链接的排序算法并在商业搜索引擎中使用.然而,在PageRank算法中,每个网页的PR值被平均地分配到它所指向的所有网页,网页之间的质量差异被完全忽略.这样的算法很容易被当前的Web SPAM攻击.基于这样的认识,提出了一个关于PageRank算法的改进,称为Page Quality Based PageRank(QPR)算法.QPR算法动态地评估每个网页的质量,并根据网页的质量对每个网页的PR值做相应公平的分配.在多个不同特性的数据集上进行了全面的实验,实验结果显示,提出的QPR算法能大大提高查询结果的排序,并能有效减轻SPAM网页对查询结果的影响.  相似文献   

4.
Given a user keyword query, current Web search engines return a list of individual Web pages ranked by their "goodness" with respect to the query. Thus, the basic unit for search and retrieval is an individual page, even though information on a topic is often spread across multiple pages. This degrades the quality of search results, especially for long or uncorrelated (multitopic) queries (in which individual keywords rarely occur together in the same document), where a single page is unlikely to satisfy the user's information need. We propose a technique that, given a keyword query, on the fly generates new pages, called composed pages, which contain all query keywords. The composed pages are generated by extracting and stitching together relevant pieces from hyperlinked Web pages and retaining links to the original Web pages. To rank the composed pages, we consider both the hyperlink structure of the original pages and the associations between the keywords within each page. Furthermore, we present and experimentally evaluate heuristic algorithms to efficiently generate the top composed pages. The quality of our method is compared to current approaches by using user surveys. Finally, we also show how our techniques can be used to perform query-specific summarization of Web pages.  相似文献   

5.
Ranking web pages for presenting the most relevant web pages to user's queries is one of the main issues in any search engine. In this paper, two new ranking algorithms are offered, using Reinforcement Learning (RL) concepts. RL is a powerful technique of modern artificial intelligence that tunes agent's parameters, interactively. In the first step, with formulation of ranking as an RL problem, a new connectivity-based ranking algorithm, called RL_Rank, is proposed. In RL_Rank, agent is considered as a surfer who travels between web pages by clicking randomly on a link in the current page. Each web page is considered as a state and value function of state is used to determine the score of that state (page). Reward is corresponded to number of out links from the current page. Rank scores in RL_Rank are computed in a recursive way. Convergence of these scores is proved. In the next step, we introduce a new hybrid approach using combination of BM25 as a content-based algorithm and RL_Rank. Both proposed algorithms are evaluated by well known benchmark datasets and analyzed according to concerning criteria. Experimental results show using RL concepts leads significant improvements in raking algorithms.  相似文献   

6.
一个普通的Web页面可以被分成信息块和噪音块两部分。基于web信息检索的第1步就是过滤掉网页中的噪音块。通过网页的特性可以看出,同层网页大多具有相似的显示风格和噪音块。在VIPS算法的基础上,该文提出一种基于同层网页相似性的匹配算法,这个算法可以被用来过滤网页中的噪音块。通过实验检测,算法可以达到95%以上的准确率。  相似文献   

7.
This paper addresses new and significant research issues in web page design in relation to the use of graphics. The original findings include that (a) graphics play an important role in enhancing the appearance and thus users' feelings (aesthetics) about web pages and that (b) the effective use of graphics is crucial in designing web pages. In addition, we have developed a web page design support database based on a user-centered experimental procedure and a neural network model. This design support database can be used to examine how a specific combination of design elements, particularly the ratio of graphics to text, will affect the users' feelings about a web page. As a general rule, the ratio of graphics to text between 3:1 and 1:1 will give the users the best feelings of ease-to-use and clear-to-follow. A web page with a ratio of 1:1 will have the most realistic look, while a ratio of over 3:1 will have the fanciest appearance. The result provides useful insights in using graphics on web pages that help web designers best meet users' specific expectations and aesthetic consistency.  相似文献   

8.
现有的搜索引擎查询结果聚类算法大多针对用户查询生成的网页摘要进行聚类,由于网页摘要篇幅较短,质量良莠不齐,聚类效果难以有较大的提高(比如后缀树算法,Lingo算法);而传统的基于全文的聚类算法运算复杂度较高,且难以生成高质量的类别标签,无法满足在线聚类的需求(比如KMeans算法)。该文提出一种基于全文最大频繁项集的网页在线聚类算法MFIC (Maximal Frequent Itemset Clustering)。算法首先基于全文挖掘最大频繁项集,然后依据网页集合之间最大频繁项集的共享关系进行聚类,最后依据类别包含的频繁项生成类别标签。实验结果表明MFIC算法降低了基于网页全文聚类的时间,聚类精度提高15%左右,且能生成可读性较好的类别标签。  相似文献   

9.
A home page is the gateway to an organization's Web site. To design effective Web home pages, it is necessary to understand the fundamental drivers of user's perception of Web pages. Not only do designers have to understand potential users' frame of mind, they also have at their choosing a stupefying array of attributes – including numerous font types, audio, video, and graphics – all of which can be arranged on a page in different ways, compounding the complexity of the design task. A theoretical model capable of explaining user reactions at a molar level should be invaluable to Web designers as a complement to prevalent intuitive and heuristic approaches. Such a model transcends piecemeal page attributes to focus on overall Web page perceptions of users. Reasoning that people perceive the cyberspace of Web pages in ways similar to their perception of physical places, we use Kaplan and Kaplan's informational model of place perception from the field of environmental psychology to predict that only two dimensions: understanding of information on a Web page, and the involvement potential of a Web page, should adequately capture Web page perception at a molar level. We empirically verify the existence of these dimensions and develop valid scales for measuring them. Using a home page as a stimulus in a lab experiment, we find that understanding and involvement together account for a significant amount of the variance in the attitude toward the Web page and in the intention to browse the underlying Web site. We show that the informational model is a parsimonious and powerful theoretical framework to measure users' perceptions of Web home pages and it could potentially serve as a guide to Web page design and testing efforts.  相似文献   

10.
在对Web站点进行优化时,为了降低成本,往往需要在不改变硬件和网络配置的情况下提高网站的性能.此时,对构成网站的网页的修改就成为提高站点性能的主要途径.对网页的访问速度的测量已有很多成熟的方法,但是如何根据测试的结果指定合理的优化策略,却鲜有论述.本文使用FCM算法对测试结果和网站日志进行聚类分析,从而得到一个良好的优化策略.  相似文献   

11.
互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。  相似文献   

12.
闪存是一种纯电子设备,具备体积小、数据读取速度快、能耗低、抗震性强等优点,被用来部分替代机械硬盘从而提升存储系统的性能.但是,现有的缓冲区置换算法都是针对机械硬盘的物理特性进行设计和优化,因此有必要针对闪存的物理特性重新设计缓冲区置换算法.提出一种新的面向闪存数据库的缓冲区替换算法CF-ARC.算法设计了一种新的页替换机制,即在替换干净页或者脏页的时候考虑其访问频度的大小,优先将访问频度少的干净页替换出缓冲区,使得热页继续留在缓冲区提高命中率,从而获得更好的性能,通过对实验结果的对比分析发现CF-ARC在多数情况下具有比其它置换算法更高的性能.  相似文献   

13.
基于维基百科社区挖掘的词语语义相似度计算   总被引:1,自引:0,他引:1  
词语语义相似度计算在自然语言处理如词义消歧、语义信息检索、文本自动分类中有着广泛的应用。不同于传统的方法,提出的是一种基于维基百科社区挖掘的词语语义相似度计算方法。本方法不考虑单词页面文本内容,而是利用维基百科庞大的带有类别标签的单词页面网信息,将基于主题的社区发现算法HITS应用到该页面网,获取单词页面的社区。在获取社区的基础上,从3个方面来考虑两个单词间的语义相似度:(1)单词页面语义关系;(2)单词页面社区语义关系;(3)单词页面社区所属类别的语义关系。最后,在标准数据集WordSimilarity-353上的实验结果显示,该算法具有可行性且略优于目前的一些经典算法;在最好的情况下,其Spearman相关系数达到0.58。  相似文献   

14.
Effectively finding relevant Web pages from linkage information   总被引:3,自引:0,他引:3  
This paper presents two hyperlink analysis-based algorithms to find relevant pages for a given Web page (URL). The first algorithm comes from the extended cocitation analysis of the Web pages. It is intuitive and easy to implement. The second one takes advantage of linear algebra theories to reveal deeper relationships among the Web pages and to identify relevant pages more precisely and effectively. The experimental results show the feasibility and effectiveness of the algorithms. These algorithms could be used for various Web applications, such as enhancing Web search. The ideas and techniques in this work would be helpful to other Web-related research.  相似文献   

15.
Donald B. Innes 《Software》1977,7(2):271-273
Many implementations of paged virtual memory systems employ demand fetching with least recently used (LRU) replacement. The stack characteristic of LRU replacement implies that a reference string which repeatedly accesses a number of pages in sequence will cause a page fault for each successive page referenced when the number of pages is greater than the number of page frames allocated to the program's LRU stack. In certain circumstances when the individual operations being performed on the referenced string are independent, or more precisely are commutative, the order of alternate page reference sequences can be reversed. This paper considers sequences which cannot be reversed and shows how placement of information on pages can achieve a similar effect if at least half the pages can be held in the LRU stack.  相似文献   

16.
Automatic identification of informative sections of Web pages   总被引:3,自引:0,他引:3  
Web pages - especially dynamically generated ones - contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.  相似文献   

17.
Web信息抽取中需要对目标网站的网页进行聚类分析,以检测并生成信息抽取所需的模板。传统的基于DOM树编辑距离的网页聚类算法不适合文档对象模型(DOM)树结构复杂的动态模板网页,提出了一种基于局部标签树匹配的改进网页聚类算法,利用标签树中模板节点和非模板节点的层次差异性,根据节点对布局影响的大小赋予节点不同的匹配权值,使用局部树匹配完成对网页结构相似性的有效计算。实验结果表明,改进的算法较传统的基于DOM树编辑距离的网页聚类算法,在对采用模板生成的动态网页进行聚类分析时具有更高的准确率,且时间复杂度低。  相似文献   

18.
Song  Xiaodong   《Performance Evaluation》2005,60(1-4):5-29
Most computer systems use a global page replacement policy based on the LRU principle to approximately select a Least Recently Used page for a replacement in the entire user memory space. During execution interactions, a memory page can be marked as LRU even when its program is conducting page faults. We define the LRU pages under such a condition as false LRU pages because these LRU pages are not produced by program memory reference delays, which is inconsistent with the LRU principle. False LRU pages can significantly increase page faults, even cause system thrashing. This poses a more serious risk in a large parallel systems with distributed memories because of the existence of coordination among processes running on individual node. In the case, the process thrashing in a single node or a small number of nodes could severely affect other nodes running coordinating processes, even crash the whole system. In this paper, we focus on how to improve the page replacement algorithm running on one node.

After a careful study on characterizing the memory usage and the thrashing behaviors in the multi-programming system using LRU replacement. we propose an LRU replacement alternative, called token-ordered LRU, to eliminate or reduce the unnecessary page faults by effectively ordering and scheduling memory space allocations. Compared with traditional thrashing protection mechanisms such as load control, our policy allows more processes to keep running to support synchronous distributed process computing. We have implemented the token-ordered LRU algorithm in a Linux kernel to show its effectiveness.  相似文献   


19.
一种新的数据库访问图算法及其应用   总被引:1,自引:1,他引:0       下载免费PDF全文
邓亚丹  景宁  熊伟 《计算机工程》2009,35(17):25-27
针对目前数据库无法预测将要访问的页面,提出应用程序访问图模型及相关的访问图算法,分析访问图相关算法的性能。在数据库GKD—Base中实现访问图算法,基于访问图算法预测未来一段时间不会被访问的页面,根据预测便可将这些页面提前交换出缓存,达到优化缓存空间的目的。实验结果表明,在数据库内核中引入CG算法后,由于缓存空间优化,SQL执行速度得到一定程度的提高。  相似文献   

20.
Although many web pages consist of blocks of text surrounded by graphics, there is a lack of valid empirical research to aid the design of this type of page [D. Diaper, P. Waelend, Interact. Comput. 13 (2000) 163]. In particular little is known about the influence of animations on interaction with web pages. Proportion, in particular the Golden Section, is known to be a key determinant of aesthetic quality of objects and aesthetics have recently been identified as a powerful factor in the quality of human–computer interaction [N. Tractinsky, A.S. Katz, D. Ikar, Interact. Comput. 13 (2000) 127]. The current study aimed to establish the relative strength of the effects of graphical display and screen ratio of content and navigation areas in web pages, using an information retrieval task and a split-plot experimental research design. Results demonstrated the effect of screen ratio, but a lack of an effect of graphical display on task performance and two subjective outcome measures. However, there was an effect of graphical display on perceived distraction, with animated display leading to more distraction than static display, t(64) = 2.33. Results are discussed in terms of processes of perception and attention and recommendations for web page design are given.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号