首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
网页在其生命周期内的活跃程度会随时间发生变化。有的网页只在特定的阶段有价值,此后就会过时。从用户的角度对网页的生命周期进行分析可以提高网络爬虫和搜索引擎的性能,改善网络广告的效果。利用一台代理服务器收集的网页访问量信息,我们对网页的生命周期进行了研究,给出了用户兴趣演变的模型。这个模型有助于更好地理解网络的组织与运行机理。  相似文献   

2.
With the advent of technology man is endeavoring for relevant and optimal results from the web through search engines. Retrieval performance can often be improved using several algorithms and methods. Abundance in web has impelled to exert better search systems. Categorization of the web pages abet fairly in addressing this issue. The anatomy of the web pages, links, categorization of text and their relations are empathized with time. Search engines perform critical analysis using several inputs for a keyword(s) to obtain quality results in shortest possible time. Categorization is mostly done with separating the content using the web link structure. We estimated two different page weights (a) Page Retaining Weight (PRW) and (b) Page Forwarding Weight (PFW) for a web page and grouped for categorization. Using these experimental results we classified the web pages into four different groups i.e. (A) Simple type (B) Axis shifted (c) Fluctuated and (d) Oscillating types. Implication in development of such categorization alleviates the performance of search engines and also delves into study of web modeling studies.  相似文献   

3.
www上的信息极大丰富,搜索引擎存在精确度低的问题。为准确地从搜索到的网页中提取有用信息,发展一个自动的过滤器已成为当务之急。本文提出的基于自然语言处理的网页过滤方法,从语法、语义和语境三个方面上分析网页的自然语言。实验表明,该方法在一定程度上提高了搜索引擎的精确度。  相似文献   

4.
Ever since the dramatic boom of the Internet, web search engines have appealed to both researchers’ and developers’ attention to a noticeable pitch. In order to enhance the website efficiency, a number of companies rent the company-wide local search engines from the search engine providers. Since web service renting serves as a promising alternative of web service purchase within the web industrial framework, how to effectively price such kind of search engines becomes an important as well as impending issue, with which practitioners and researchers confront. In this paper, we present a pricing model, which is based on the discrete-time independent incremental process, for the local search engines of the company website. The stopping time is defined in this work and the expected revenue for the web-search-engine providers over the rental horizon is also derived. Due to the considerable complexity and difficulty to obtaining an analytical solution for estimation of the expected revenue, the optimal monthly rental is discussed and exemplified through empirical experiments. By maximizing the revenue, two different strategies are investigated by allowing different initial lock-in periods and offering a coupon for waiving certain amount of fee for initial use. The experiments illustrate the best rental and sale price scenarios.  相似文献   

5.
平行语料是自然语言处理中一项重要的基础资源,在双语平行网页中大量存在。该文首先介绍双语URL匹配模式的可信度计算方法,然后提出基于局部可信度的双语平行网页识别算法,再依据匹配模式的全局可信度,提出两种优化方法: 即利用全局可信度,救回因低于局部可信度阈值而被初始算法滤掉的匹配模式;通过全局可信度和网页检测方法,挖出深层网页。进一步,结合网站双语可信度、链接关系,侦测出种子网站周边更多较具可信度的双语网站。除了双语URL匹配模式自动识别,还利用搜索引擎,依据少数高可信度的匹配模式快速识别双语网页。为了提高以上五种方法识别候选双语网页对的准确率,计算了候选双语网页对的双语相似度,并设置阈值过滤非双语网页对。通过实验验证了所提方法的有效性。  相似文献   

6.
《Applied Soft Computing》2007,7(1):398-410
Personalized search engines are important tools for finding web documents for specific users, because they are able to provide the location of information on the WWW as accurately as possible, using efficient methods of data mining and knowledge discovery. The types and features of traditional search engines are various, including support for different functionality and ranking methods. New search engines that use link structures have produced improved search results which can overcome the limitations of conventional text-based search engines. Going a step further, this paper presents a system that provides users with personalized results derived from a search engine that uses link structures. The fuzzy document retrieval system (constructed from a fuzzy concept network based on the user's profile) personalizes the results yielded from link-based search engines with the preferences of the specific user. A preliminary experiment with six subjects indicates that the developed system is capable of searching not only relevant but also personalized web pages, depending on the preferences of the user.  相似文献   

7.
随着互联网技术的飞速发展,网页数量急剧增加,搜索引擎的地位已经不可取代,成为人们使用Internet的入口。网络蜘蛛作为搜索引擎的信息来源是搜索引擎必不可少的组成部分。介绍网络蜘蛛设计中的关键技术。另外,随着用户个性化需求越来越强以及网页数量的急剧增加导致通用搜索引擎无法满足特定用户的需求,专业搜索引擎得到快速的发展。同时对于主题爬虫的研究也有很大的突破和进展。主题爬虫有别于通用爬虫,通用爬虫注重爬取的完整性,而主题爬虫强调网页与特定主题的相关性。同时对主题爬虫的研究现状进行介绍和总结。  相似文献   

8.
Nowadays, searches for webpages of a person with a given name constitute a notable fraction of queries to web search engines. Such a query would normally return webpages related to several namesakes, who happened to have the queried name, leaving the burden of disambiguating and collecting pages relevant to a particular person (from among the namesakes) on the user. In this article we develop a Web People Search approach that clusters webpages based on their association to different people. Our method exploits a variety of semantic information extracted from Web pages, such as named entities and hyperlinks, to disambiguate among namesakes referred to on the Web pages. We demonstrate the effectiveness of our approach by testing the efficacy of the disambiguation algorithms and its impact on person search.  相似文献   

9.
网络数据的飞速增长为搜索引擎带来了巨大的存储和网络服务压力,大量冗余、低质量乃至垃圾数据造成了搜索引擎存储与运算能力的巨大浪费,在这种情况下,如何建立适合万维网实际应用环境的网页数据质量评估体系与评估算法成为了信息检索领域的重要研究课题。在前人工作的基础上,通过网络用户及网页设计人员的参与,文章提出了包括权威知名度、内容、时效性和网页外观呈现四个维度十三个因素的网页质量评价体系;标注数据显示我们的网页质量评价体系具有较强的可操作性,标注结果比较一致;文章最后使用Ordinal Logistic Regression 模型对评价体系的各个维度的重要性进行了分析并得出了一些启发性的结论 互联网网页内容和实效性能否满足用户需求是决定其质量的重要因素。  相似文献   

10.
This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for popular vs. rare queries. Therefore, we send 500 queries of both types to the major search engines Google, Yahoo, Live.com and Ask. We count how often the different elements are used by the individual engines. In total, our study is based on 42,758 elements. Findings include that search engines use quite different approaches to results pages composition and therefore, the user gets to see quite different results sets depending on the search engine and search query used. Organic results still play the major role in the results pages, but different shortcuts are of some importance, too. Regarding the frequency of certain host within the results sets, we find that all search engines show Wikipedia results quite often, while other hosts shown depend on the search engine used. Both Google and Yahoo prefer results from their own offerings (such as YouTube or Yahoo Answers). Since we used the .com interfaces of the search engines, results may not be valid for other country-specific interfaces.  相似文献   

11.
12.
该文提出了一种从搜索引擎返回的结果网页中获取双语网页的新方法,该方法分为两个任务。第一个任务是自动地检测并收集搜索引擎返回的结果网页中的数据记录。该步骤通过聚类的方法识别出有用的记录摘要并且为下一个任务即高质量双语混合网页的验证及其获取提供有效特征。该文中把双语混合网页的验证看作是有效的分类问题,该方法不依赖于特定领域和搜索引擎。基于从搜索引擎收集并经过人工标注的2 516条检索结果记录,该文提出的方法取得了81.3%的精确率和94.93%的召回率。  相似文献   

13.
The enormous amount of information available on the Internet requires the use of search engines in order to find specific information. As far as web accessibility is concerned, search engines contain two kinds of barriers: on the one hand, the interfaces for making queries and accessing results are not always accessible; on the other hand, web accessibility is not taken into account in information retrieval (IR) processes. Consequently, in addition to interface problems, accessing the items in the list of results tends to be an unsatisfactory experience for people with disabilities. Some groups of users cannot take advantage of the services provided by search engines, as the results are not useful due to their accessibility restrictions. The goal of this paper is to propose the integration of web accessibility measurement into information retrieval processes. Firstly, quantitative accessibility metrics are defined in order to accurately measure the accessibility level of web pages. Secondly, a model to integrate these metrics within IR processes is proposed. Finally, a prototype search engine which re-ranks results according to their accessibility level based on the proposed model is described.  相似文献   

14.
With the Internet growing exponentially, search engines are encountering unprecedented challenges. A focused search engine selectively seeks out web pages that are relevant to user topics. Determining the best strategy to utilize a focused search is a crucial and popular research topic. At present, the rank values of unvisited web pages are computed by considering the hyperlinks (as in the PageRank algorithm), a Vector Space Model and a combination of them, and not by considering the semantic relations between the user topic and unvisited web pages. In this paper, we propose a concept context graph to store the knowledge context based on the user's history of clicked web pages and to guide a focused crawler for the next crawling. The concept context graph provides a novel semantic ranking to guide the web crawler in order to retrieve highly relevant web pages on the user's topic. By computing the concept distance and concept similarity among the concepts of the concept context graph and by matching unvisited web pages with the concept context graph, we compute the rank values of the unvisited web pages to pick out the relevant hyperlinks. Additionally, we constitute the focused crawling system, and we retrieve the precision, recall, average harvest rate, and F-measure of our proposed approach, using Breadth First, Cosine Similarity, the Link Context Graph and the Relevancy Context Graph. The results show that our proposed method outperforms other methods.  相似文献   

15.
The problem of obtaining relevant results in web searching has been tackled with several approaches. Although very effective techniques are currently used by the most popular search engines when no a priori knowledge on the user's desires beside the search keywords is available, in different settings it is conceivable to design search methods that operate on a thematic database of web pages that refer to a common body of knowledge or to specific sets of users. We have considered such premises to design and develop a search method that deploys data mining and optimization techniques to provide a more significant and restricted set of pages as the final result of a user search. We adopt a vectorization method based on search context and user profile to apply clustering techniques that are then refined by a specially designed genetic algorithm. In this paper we describe the method, its implementation, the algorithms applied, and discuss some experiments that has been run on test sets of web pages.  相似文献   

16.
海量网页的存在及其量的急速增长使得通用搜索引擎难以为面向主题或领域的查询提供满意结果。本文研究的主题爬虫致力于收集主题相关信息,达到极大降低网页处理量的目的。它通过评价网页的主题相关度,并优先爬取相关度较高的网页。利用一种基于子空间的语义分析技术,并结合贝叶斯以及支持向量机,设计并实现了一个高效的主题爬虫。实验表明,此算法具有很好的准确性和高效性。  相似文献   

17.
基于归类的链接分析技术   总被引:1,自引:0,他引:1  
王元珍  陈涛 《计算机工程与应用》2005,41(13):172-173,203
在目前主流搜索引擎中,链接分析是最常用的计算网页价值度的工具,但是对于用户输入比较宽泛的查询主题,链接分析算法很难得到一个令所有用户都满意的结果。论文试图从另外一个角度来改进链接分析算法,即在传统的链接分析基础上,增加Web聚类算法的有关思想,并对这两种算法进行了改进和组合,提出了一种基于归类的链接分析技术,并用实验结果证明了该算法的性能。  相似文献   

18.
Whenever machine learning is used to prevent illegal or unsanctioned activity and there is an economic incentive, adversaries will attempt to circumvent the protection provided. Constraints on how adversaries can manipulate training and test data for classifiers used to detect suspicious behavior make problems in this area tractable and interesting. This special issue highlights papers that span many disciplines including email spam detection, computer intrusion detection, and detection of web pages deliberately designed to manipulate the priorities of pages returned by modern search engines. The four papers in this special issue provide a standard taxonomy of the types of attacks that can be expected in an adversarial framework, demonstrate how to design classifiers that are robust to deleted or corrupted features, demonstrate the ability of modern polymorphic engines to rewrite malware so it evades detection by current intrusion detection and antivirus systems, and provide approaches to detect web pages designed to manipulate web page scores returned by search engines. We hope that these papers and this special issue encourages the multidisciplinary cooperation required to address many interesting problems in this relatively new area including predicting the future of the arms races created by adversarial learning, developing effective long-term defensive strategies, and creating algorithms that can process the massive amounts of training and test data available for internet-scale problems.  相似文献   

19.
因特网的迅速发展对传统的爬行器和搜索引擎提出了巨大的挑战。各种针对特定领域、特定人群的搜索引擎应运而生。Web主题信息搜索系统(网络蜘蛛)是主题搜索引擎的最主要的部分,它的任务是将搜集到的符合要求的Web页面返回给用户或保存在索引库中。Web 上的信息资源如此广泛,如何全面而高效地搜集到感兴趣的内容是网络蜘蛛的研究重点。提出了基于网页分块技术的主题爬行,实验结果表明,相对于其它的爬行算法,提出的算法具有较高的效率、爬准率、爬全率及穿越隧道的能力。  相似文献   

20.
聚焦爬虫搜集与特定主题相关的页面,为搜索引擎构建页面集。传统的聚焦爬虫采用向量空间模型和局部搜索算法,精确率和召回率都比较低。文章分析了聚焦爬虫存在的问题及其相应的解决方法。最后对未来的研究方向进行了展望。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号