首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
因特网的迅速发展对传统的爬行器和搜索引擎提出了巨大的挑战。各种针对特定领域、特定人群的搜索引擎应运而生。Web主题信息搜索系统(网络蜘蛛)是主题搜索引擎的最主要的部分,它的任务是将搜集到的符合要求的Web页面返回给用户或保存在索引库中。Web 上的信息资源如此广泛,如何全面而高效地搜集到感兴趣的内容是网络蜘蛛的研究重点。提出了基于网页分块技术的主题爬行,实验结果表明,相对于其它的爬行算法,提出的算法具有较高的效率、爬准率、爬全率及穿越隧道的能力。  相似文献   

2.
为了能够充分利用WEB上丰富的文献资源,设计了一个专业的WEB文献资料采集系统WLES。该系统集成了网页抓取和网页清洗两方面技术,并且引入机器学习方法到网页清洗中,通过机器对训练语料的学习得到一个清洗模型,然后用该模型来实施网页清洗。实验证明该系统在网页抓取和网页清洗方面都具有优良的性能,能够满足使用者的文献采集需求。  相似文献   

3.
Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap in the coverage of search engines. The paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment (the deep web database) according to Q-value. While the existing methods rely on an assumption that all deep web databases possess full-text search interfaces and solely utilize the statistics (TF or DF) of acquired data records to generate the next query, the reinforcement learning framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and relaxes the assumption of full-text search implied by existing methods.  相似文献   

4.
将deep Web发掘与主题爬行技术有机地结合起来,对deep Web垂直搜索引擎系统的关键技术进行了深入研究.首先设计了deep Web主题爬行框架,它是在传统的主题爬行框架的基础上,加入了前端分类器作为爬行策略的执行机构,并对该分类器做定期的增量更新;然后使用主题爬行技术指导deep Web发掘,并且借助开源组件Lucene将主题爬行器所搜索的信息进行合理的安排,以便为检索接口提供查询服务.当用户向搜索引擎提交查询词后,Lucene缺省按照自己的相关度算法对结果进行排序.通过爬虫、索引器和查询接口的设计,实现了一个面向deep Web的垂直搜索引擎原型系统.  相似文献   

5.
刘徽  黄宽娜  余建桥 《计算机工程》2012,38(11):284-286
Deep Web包含丰富的、高质量的信息资源,由于没有直接指向Deep Web页面的静态链接,目前大多搜索引擎不能发现这些页 面,只能通过填写表单提交查询获取。为此,提出一种Deep Web爬虫爬行策略。用网页分类器的分层结果指导链接信息提取器提取有前途的链接,将爬行深度限定在3层,从最靠近查询表单中提取链接,且只提取属于这3个层次的链接,从而减少爬虫爬行时间,提高爬虫的准确度,并设计聚焦爬行算法的约束条件。实验结果表明,该策略可以有效地下载Deep Web页面,提高爬行效率。  相似文献   

6.
This paper aims to design and develop a Distributed Virtual Geographic Environment (DVGE) system. A DVGE system is an Internet-based virtual 2D and 3D environment that provides users with a shared space and a collaborative platform for publishing multidimensional geo-data, and for simulating and analyzing complex geo-phenomena. Users logging into the system from different clients can share distributed geo-information resources, including geo-data and geo-models, and can complete collaborative tasks. Web service technology provides effective solutions for constructing DVGE systems because of its ability to support multi-platform, multi-architecture, and multi-program-language interoperability on the Internet, but also because of its ability to share programs, data, and software. This paper analyzes the characteristics, relevant technologies, and specifications of web services, such as grid services, Open Geo-data Interoperability Specifications (OpenGIS), and Geography Markup Languages (GML). The architecture and working mechanisms of the DVGE system based on web services are then elaborated. To demonstrate DVGE systems based on web services, we examine a case study of water pollution in Yangzhou City, Jiangsu Province, China, using a prototype DVGE system that is developed with Jbuilder9.0 and Java3D 1.0 packages, and the Weblogic platform 8.1.  相似文献   

7.
主题搜索引擎中网络爬虫的搜索策略研究   总被引:2,自引:0,他引:2       下载免费PDF全文
本文对主题搜索引擎中的网络蜘蛛搜索策略进行了详细的分析,在深入分析主题页面在Web上的分布特征与主题相关性判别算法的基础上提出了一个面向主题搜索的网络蜘蛛模型,对模型的组织结构进行了详细阐述。作为主题网络蜘蛛搜索策略的核心部分,主题相关性判断算法是网络蜘蛛能够围绕设定主题进行聚焦检索的关键。在URL的主题相关性判别过程中引入了链接文本及相关链接属性分析,提出了一种新颖的URL主题相关性算法--EPR算法。  相似文献   

8.
文章从搜索引擎的应用出发,探讨了网络蜘蛛在搜索引擎中的作用和地住,提出了网络蜘蛛的功能和设计要求.在对网络蜘蛛系统结构和工作原理所作分析的基础上,研究了线程调度、页面爬取、解析等策略和算法,并使用Java实现了一个网络蜘蛛的程序,对其运行结果做了分析.  相似文献   

9.
周凤丽  林晓丽 《微机发展》2012,(1):140-142,160
互联网的快速发展也使搜索引擎不断的发展着,而搜索引擎逐渐转向商业化运行,使得搜索引擎的技术细节越来越隐蔽。文章研究和分析了搜索引擎工具Lucene的原理、模型和索引器,设计了一个搜索引擎系统。该系统采用了非递归的方式负责Web站点的网页爬取以及爬取过程中URL链接的存储、处理等,并通过多线程技术管理多个抓取线程,实现了并发抓取网页,提高了系统的运行效率。最后采用JSP技术设计了一个简易的新闻搜索引擎客户端,系统可以稳定运行,基本符合搜索引擎原理的探索,具有一定的现实意义。  相似文献   

10.
基于OGC WPS标准的处理服务实现研究   总被引:4,自引:0,他引:4  
Web Services为空间信息处理功能的互操作提供了一种通过网络访问的解决方案,但Web Services标准中缺少对空间信息元数据的定义.致力于GIS资源共享和处理互操作的Open Geospatial Consortium(OGC)组织,针对该问题制定了Web Processing Services(WPS)标准.基于WPS 3种主要方法提出了一种可扩展的WPS实现体系结构,用于解决空间信息互操作的问题.根据本体系结构,实现了一个镶嵌处理的demo WPS平台.试验证明,本体系结构具有很好的合理性、灵活性以及可扩展性,能够更好地解决处理功能互操作的问题.  相似文献   

11.
Web主题检索是信息检索领域一个将采集技术与过滤方法结合的新兴方向,也是信息处理领域的研究热点。针对现有主题检索系统在Web页面文本的主题相关性判断和Spider搜索策略方面存在的问题,引入两个性能优化方案,即利用信息抽取技术,提出了一种基于模式集的主题相关性判断方法来提高主题判断准确度;针对pagerank在主题检索中存在的不足,引入基于增强学习的页面评估算法,提出了Web环境优先的搜索策略。最后根据实验结果评估两个算法的性能。  相似文献   

12.
In recent years, there has been considerable research on constructing crawlers which find resources satisfying specific conditions called predicates. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint on the internal structure of the web page. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for performing the topic specific resource discovery process. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. For example, proxy caches such as Squid are hierarchical proxies which make their logs publically available. As we shall see in this paper, such traces are a rich source of information which can be mined in order to find the users that are most relevant to the topic of a given crawl. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy turns out to be extremely effective because the topical consistency in world wide web browsing patterns turns out to very high compared to the noisy linkage information. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system which works more effectively than a system based purely on either user behavior or hyperlinks.  相似文献   

13.
为满足用户精确化和个性化获取信息的需要,通过分析Deep Web信息的特点,提出了一个可搜索不同主题Deep Web 信息的爬虫框架.针对爬虫框架中Deep Web数据库发现和Deep Web爬虫爬行策略两个难题,分别提出了使用通用搜索引擎以加快发现不同主题的Deep Web数据库和采用常用字最大限度下载Deep Web信息的技术.实验结果表明了该框架采用的技术是可行的.  相似文献   

14.
15.
基于Web Service进行信息共享和集成的关键技术   总被引:1,自引:0,他引:1  
由于Web Service是基于标准协议和规范(包括HTTP、SOAP、XML、WSDL、UDDI等),并且平台独立和语言独立的,因此Web Service已成为基于Internet网进行信息交换、信息共享、信息集成和互操作的主流技术,在电子商务和电子政务中被广泛应用.详细介绍了基于Web Service进行信息共享和集成的关键技术--Web Service异步调用技术和动态调用技术.通过Web Service异步调用技术,实现了Web Service的分布式并行执行,提高了计算的效率.通过Web Service动态调用技术,使整个软件系统具有可扩充性,并且易于维护,能满足Internet环境动态性的要求.  相似文献   

16.

Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.

  相似文献   

17.
18.
聚焦爬虫技术研究综述   总被引:51,自引:1,他引:50  
周立柱  林玲 《计算机应用》2005,25(9):1965-1969
因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。至今,聚焦爬虫已成为有关万维网的研究热点之一。文中对这一热点研究进行综述,给出聚焦爬虫(Focused Crawler)的基本概念,概述其工作原理;并根据研究的发展现状,对聚焦爬虫的关键技术(抓取目标描述,网页分析算法和网页搜索策略等)作系统介绍和深入分析。在此基础上,提出聚焦爬虫今后的一些研究方向,包括面向数据分析和挖掘的爬虫技术研究,主题的描述与定义,相关资源的发现,W eb数据清洗,以及搜索空间的扩展等。  相似文献   

19.
分布式搜索引擎系统效能建模与评价   总被引:1,自引:0,他引:1  
张伟哲  张宏莉  许笑  何慧 《软件学报》2012,23(2):253-265
针对分布式搜索引擎系统效能建模与评估问题,通过对当前分布式搜索引擎系统的建模与分类,扩展了能耗与网络开销的成本模型;对5种构建搜索引擎系统的设计方案,从系统成本、系统规模和查询响应时间等角度进行了详尽的理论分析与评价,由此发现,由广域网分布式采集系统和多机群索引系统组成的半广域网搜索引擎系统相对于其他系统具有相对较高的效能,同时能够较好地兼顾用户的服务质量.  相似文献   

20.
基于Agent的移动Web服务集成方案   总被引:1,自引:0,他引:1       下载免费PDF全文
茹蓓  肖云鹏  张俊鹏 《计算机工程》2012,38(9):49-50,54
结合Aglets平台和J2EE servlet技术,提出一种基于Agent的J2ME移动Web服务3层集成方案。在终端层,使用轻量级代理接入方式减少移动设备资源受限系统的负载需求。在Web接入层,采用Web服务标准接入方式确保异构移动平台的统一接入。在移动Agent层,通过多Agent协同工作保证系统高效性与灵活性。在此基础上,设计并实现一个移动进货比价系统。应用结果表明,该方案能提高无线环境下J2ME设备发现、访问Web服务的效率与健壮性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号