首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
为提高主题爬虫的性能,依据站点信息组织的特点和URL的特征,提出一种基于URL模式集的主题爬虫。爬虫分两阶段,在实验爬虫阶段,采集站点样本数据,采用基于URL前缀树的模式构建算法构建URL模式,形成模式关系图,并利用HITS算法分析该模式关系图,计算出各模式的重要度;在聚焦爬虫阶段,无需预先下载页面,即可利用生成的URL模式判断页面是否主题相关和能否指导爬虫深入抓取,并根据URL模式的重要度预测待抓取链接优先级。实验表明,该爬虫相比现有的主题爬虫能快速引导爬虫抓取主题相关页面,保证爬虫的查准率和查全率,有效提高爬虫抓取效率。  相似文献   

2.
基于遗传算法的主题爬行技术研究   总被引:3,自引:0,他引:3  
针对目前主题搜索策略的不足,提出了基于遗传箅法的主题爬行策略,提高了链接于内容相似度不高的网页之后的页面被搜索的机会,扩大了相关网页的搜索范围.同时,在网页相关度分析方面,引入了基于本体语义的主题过滤策略.实验结果表明,基于遗传算法的主题爬虫抓取网页中的主题相关网页数量多,在合理选择种子集合时,能够抓取大量的主题相关度高的网页.  相似文献   

3.
针对传统通用网络爬虫的自身固有的缺陷,结合本体的相关理论,提出了一种基于语义本体的网络爬虫的相关模型。该模型以本体构建领域知识概念集,结合知网,从语义的角度,利用扩展的元数据,在词的语义层次,对抓取的页面链接进行语义相关性计算,预测与主题相关的URL,提高采集的网络资源信息与设定主题的相关度。实验结果表明,该模型同其它通用网络爬虫模型相比具有较高的信息抓取准确率。  相似文献   

4.
无论是通用搜索还是垂直搜索,其关键的核心技术之一就是网络爬虫的设计.本文结合HTMLParser信息提取方法,对生活类垂直搜索引擎中网络爬虫进行了详细研究.通过深入分析生活类网站网址的树形结构的构架,开发了收集种子页面URL的模拟搜索器,并基于HTMLParser的信息提取方法,从种子页面中提取出与生活类主题相关的目标URL.经实验测试证明该爬虫的爬准率达93.552%,爬全率达96.720%,表明该网络爬虫是有效的,达到中等规模的垂直搜索企业级应用的要求.  相似文献   

5.
互联网网页所形成的主题孤岛严重影响了搜索引擎系统的主题爬虫性能,通过人工增加大量的初始种子链接来发现新主题的方法无法保证主题网页的全面性.在分析传统基于内容分析、基于链接分析和基于语境图的主题爬行策略的基础上,提出了一种基于动态隧道技术的主题爬虫爬行策略.该策略结合页面主题相关度计算和URL链接相关度预测的方法确定主题孤岛之间的网页页面主题相关性,并构建层次化的主题判断模型来解决主题孤岛之间的弱链接问题.同时,该策略能有效防止主题爬虫因采集过多的主题无关页面而导致的主题漂移现象,从而可以实现在保持主题语义信息的爬行方向上的动态隧道控制.实验过程利用主题网页层次结构检测页面主题相关性并抽取“体育”主题关键词,然后以此对采集的主题网页进行索引查询测试.结果表明,基于动态隧道技术的爬行策略能够较好的解决主题孤岛问题,明显提升了“体育”主题搜索引擎的准确率和召回率.  相似文献   

6.
为了解决传统主题爬虫效率偏低的问题,传统主题爬虫会选择最有价值的链接进行访问,仅简单地计算链接的相关性,却忽视待分析URL之间的相关性关系,致使主题爬虫爬取效率较低。提出一种基于链接模型的相关性判别算法,综合利用有标种子URL和无标的待判别URL实现对无标URL的相关性判别,并推导出迭代初值选取对结果的不敏感性。实验结果表明,与传统的网络爬虫算法相关性判别方法相比,提出的方法效率更高。  相似文献   

7.
如今,互联网集成的与暴雨灾害相关的信息多种多样,然而人工搜索网页信息的效率不高,因此网络主题爬虫显得十分重要。在通用网络爬虫的基础上,为提高主题相关度的计算精度并预防主题漂移,通过对链接锚文本主题相关度、链接所在网页的主题相关度、链接指向网页PR值和该网页主题相关度的综合计算,提出了基于网页内容和链接结构相结合的超链接综合优先度评估方法。同时,针对搜索过程易陷入局部最优的不足,首次设计了结合爬虫记忆历史主机信息和模拟退火的网络主题爬虫算法。以暴雨灾害为主题进行爬虫实验的结果表明,在爬取相同网页数的情况下,相比于广度优先搜索策略(Breadth First Search,BFS)和最佳优先搜索策略(Optimal Priority Search,OPS),所提出的算法能抓取到更多与主题相关的网页,爬虫算法的准确率得到明显提升。  相似文献   

8.
基于遗传算法的聚焦爬虫搜索策略   总被引:1,自引:0,他引:1       下载免费PDF全文
曾广朴  范会联 《计算机工程》2010,36(11):167-169
为了提高聚焦爬虫的搜索效率,提出一种结合内容评价和链接结构搜索策略的优点并利用小生境遗传算法进行全局寻优的搜索策略。改进遗传算子和小生境遗传算法,将待搜索的网页URL作为遗传个体,采用概率变迁规则和小生境淘汰运算引导搜索方向。实验结果证明,与聚焦爬虫的其他实现技术相比,该策略在抓取主题相关网页时具有更高的查准率和查全率。  相似文献   

9.
本文提出以爬行控制器和页面分析过滤器为核心的聚焦爬虫设计方法。从待检索主题出发,在以改进的遗传算法为基础并结合内容评价和链接结构搜索策略优点的爬行策略引导下,以待爬行URL作为遗传个体,基于主题词集的向量空间模型评估个体适应度,引入新的URL实现交叉、变异操作,将具有相同URL前缀的链接按小生境处理。实践证明,该爬虫具有较好的性能。  相似文献   

10.
主题爬虫的目的在于尽可能准确地获取与特定主题相关的内容。针对主题爬虫主题覆盖率不足和主题相似度计算准确度偏低,提出一种动态主题的主题爬虫框架,对主题关键词进行两重扩展:用同主题的词扩展和词的语义扩展。利用主题爬虫自身主题相关资源收集的功能,不断对语料进行扩充,通过LDA训练得到主题文档来进行主题词库扩展更新。在此基础上,提出一种基于word2vec词向量表示的改进相似度计算模型,用于页面相似度计算和URL优先级排序。通过在真实新闻数据集上的实验表明,提出的爬虫在主题相关度的判断准确度和主题内容收获率上均有较好表现。  相似文献   

11.
Abstract This paper describes an approach to the design of interactive multimedia materials being developed in a European Community project. The developmental process is seen as a dialogue between technologists and teachers. This dialogue is often problematic because of the differences in training, experience and culture between them. Conditions needed for fruitful dialogue are described and the generic model for learning design used in the project is explained.  相似文献   

12.
European Community policy and the market   总被引:1,自引:0,他引:1  
Abstract This paper starts with some reflections on the policy considerations and priorities which are shaping European Commission (EC) research programmes. Then it attempts to position the current projects which seek to capitalise on information and communications technologies for learning in relation to these priorities and the apparent realities of the marketplace. It concludes that while there are grounds to be optimistic about the contribution EC programmes can make to the efficiency and standard of education and training, they are still too technology driven.  相似文献   

13.
融合集成方法已经广泛应用在模式识别领域,然而一些基分类器实时性能稳定性较差,导致多分类器融合性能差,针对上述问题本文提出了一种新的基于多分类器的子融合集成分类器系统。该方法考虑在度量层融合层次之上通过对各类基多分类器进行动态选择,票数最多的类别作为融合系统中对特征向量识别的类别,构成一种新的自适应子融合集成分类器方法。实验表明,该方法比传统的分类器以及分类融合方法识别准确率明显更高,具有更好的鲁棒性。  相似文献   

14.
Development of software intensive systems (systems) in practice involves a series of self-contained phases for the lifecycle of a system. Semantic and temporal gaps, which occur among phases and among developer disciplines within and across phases, hinder the ongoing development of a system because of the interdependencies among phases and among disciplines. Such gaps are magnified among systems that are developed at different times by different development teams, which may limit reuse of artifacts of systems development and interoperability among the systems. This article discusses such gaps and a systems development process for avoiding them.  相似文献   

15.
This paper presents control charts models and the necessary simulation software for the location of economic values of the control parameters. The simulation program is written in FORTRAN, requires only 10K of main storage, and can run on most mini and micro computers. Two models are presented - one describes the process when it is operating at full capacity and the other when the process is operating under capacity. The models allow the product quality to deteriorate to a further level before an existing out-of-control state is detected, and they can also be used in situations where no prior knowledge exists of the out-of-control causes and the resulting proportion defectives.  相似文献   

16.
为了设计一种具有低成本、低功耗、易操作、功能强且可靠性高的煤矿井下安全分站,针对煤矿安全生产实际,文章提出了采用MCS-51系列单片机为核心、具有CAN总线通信接口的煤矿井下安全监控分站的设计方案;首先给出煤矿井下安全监控分站的整体构架设计,然后着重阐述模拟量输入信号处理系统的设计过程,最后说明单片机最小系统及其键盘、显示、报警、通信等各个组成部分的设计;为验证设计方案的可行性与有效性,使用Proteus软件对设计内容进行仿真验证,设计的煤矿井下安全监控分站具有瓦斯、温度等模拟量参数超标报警功能和电机开停、风门开闭等开关量指示功能;仿真结果表明:设计的煤矿井下安全监控分站具有一定的实际应用价值.  相似文献   

17.
Going through a few examples of robot artists who are recognized worldwide, we try to analyze the deepest meaning of what is called “robot art” and the related art field definition. We also try to highlight its well-marked borders, such as kinetic sculptures, kinetic art, cyber art, and cyberpunk. A brief excursion into the importance of the context, the message, and its semiotics is also provided, case by case, together with a few hints on the history of this discipline in the light of an artistic perspective. Therefore, the aim of this article is to try to summarize the main characteristics that might classify robot art as a unique and innovative discipline, and to track down some of the principles by which a robotic artifact can or cannot be considered an art piece in terms of social, cultural, and strictly artistic interest. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

18.
Although there are many arguments that logic is an appropriate tool for artificial intelligence, there has been a perceived problem with the monotonicity of classical logic. This paper elaborates on the idea that reasoning should be viewed as theory formation where logic tells us the consequences of our assumptions. The two activities of predicting what is expected to be true and explaining observations are considered in a simple theory formation framework. Properties of each activity are discussed, along with a number of proposals as to what should be predicted or accepted as reasonable explanations. An architecture is proposed to combine explanation and prediction into one coherent framework. Algorithms used to implement the system as well as examples from a running implementation are given.  相似文献   

19.
This paper provides the author's personal views and perspectives on software process improvement. Starting with his first work on technology assessment in IBM over 20 years ago, Watts Humphrey describes the process improvement work he has been directly involved in. This includes the development of the early process assessment methods, the original design of the CMM, and the introduction of the Personal Software Process (PSP)SM and Team Software Process (TSP){SM}. In addition to describing the original motivation for this work, the author also reviews many of the problems he and his associates encountered and why they solved them the way they did. He also comments on the outstanding issues and likely directions for future work. Finally, this work has built on the experiences and contributions of many people. Mr. Humphrey only describes work that he was personally involved in and he names many of the key contributors. However, so many people have been involved in this work that a full list of the important participants would be impractical.  相似文献   

20.
基于复小波噪声方差显著修正的SAR图像去噪   总被引:4,自引:1,他引:3  
提出了一种基于复小波域统计建模与噪声方差估计显著性修正相结合的合成孔径雷达(Synthetic Aperture Radar,SAR)图像斑点噪声滤波方法。该方法首先通过对数变换将乘性噪声模型转化为加性噪声模型,然后对变换后的图像进行双树复小波变换(Dualtree Complex Wavelet Transform,DCWT),并对复数小波系数的统计分布进行建模。在此先验分布的基础上,通过运用贝叶斯估计方法从含噪系数中恢复原始系数,达到滤除噪声的目的。实验结果表明该方法在去除噪声的同时保留了图像的细节信息,取得了很好的降噪效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号