首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对网页正文提取算法缺乏通用性,以及对新闻网页的提取缺乏标题、时间、来源信息的问题,提出一种新闻关键信息的提取算法newsExtractor。该算法首先通过预处理将网页转换成行号和文本的集合,然后根据字数最长的一句话出现在新闻正文的概率极高的特点,从正文中间开始向两端寻找正文的起点和终点提取新闻正文,根据最长公共子串算法提取标题,构造正则表达式并以行号辅助判断提取时间,根据来源的格式特点并辅以行号提取来源;最后构造了数据集与国外开源软件newsPaper进行提取准确率的对比实验。实验结果表明,newsExtractor在正文、标题、时间、来源的平均提取准确率上均优于newsPaper,具有通用性和鲁棒性。  相似文献   

2.
刘伟  严华梁 《计算机工程》2012,38(11):167-169
提出一种统一的Web新闻对象自动抽取方法。通过抽取新闻页面中的分类、标题、发布时间、来源、作者、内容、相关评论链接和相关新闻链接作为分类属性,经页面解析、候选值抽取、真值识别3个步骤,实现新闻对象的自动抽取。实验结果表明,该方法在同时抽取新闻对象的多个属性方面具有较高的准确性,且抽取结果不依赖于特定的页面模板。  相似文献   

3.
面对互联网上占据全国页面总数50%以上的动态页面,当前网络舆情管控工作中的信息采集环节对以动态页面为主要发布形态的互联网媒体无法实现信息获取。鉴于此,文中提出了基于Rhino实现JavaScript动态页面解析的整体方案。实验结果表明该方案充分丰富了互联网舆情管控工作的数据源对象,是实现动态页面内超链接网络地址递归获取和网页主体内容提取行之有效的解决方案。  相似文献   

4.
基于Rhino的JavaScript动态页面解析研究与实现   总被引:1,自引:0,他引:1  
面对互联网上占据全国页面总数50%以上的动态页面,当前网络舆情管控工作中的信息采集环节对以动态页面为主要发布形态的互联网媒体无法实现信息获取.鉴于此,文中提出了基于Rhino实现JavaScript动态页面解析的整体方案.实验结果表明该方案充分丰富了互联网舆情管控工作的数据源对象,是实现动态页面内超链接网络地址递归获取和网页主体内容提取行之有效的解决方案.  相似文献   

5.
目前主流的网页抽取方法存在两大问题:提取信息类型单一,难以获取多种类新闻信息;多依赖HTML标签,难以扩展至不同来源。为此提出一种基于多维度文本特征的新闻网页信息提取方法,利用新闻文本的写作特点划分出写作、语义和位置特征,通过多通道卷积神经网络融合为多维度文本特征,用于提取多种类新闻网页信息;仅需少量数据集训练,就可提取新来源的新闻网页信息。实验结果表明,该方法在性能上高于当前最优方法。  相似文献   

6.
郭孝园  何臻 《工矿自动化》2012,38(8):100-104
为了解决煤矿企业网站用户查找信息难的问题,提出了一种基于Web日志的煤矿企业网站个性化推荐服务模型。该模型应用关联规则对新用户进行页面推荐,应用聚类算法对老用户进行页面推荐;并结合点击网页的次数、网页的浏览时间、雅可系数与最长公共路径系数来度量用户兴趣度的方法,可为用户准确地推荐其感兴趣的页面。测试结果表明,该模型能够有效地对网页资源进行分类并进行个性化推荐。  相似文献   

7.
了解用户的兴趣所在是电子商务网站的一项重要工作。文中提出了一种分析用户兴趣度的新方法.该方法综合考虑了用户浏览网页的时间长度、点击网页的次数、网页的类型和内容,它首先将网站的内容归纳为若干主题,然后利用模糊综合评判得出用户对于某主题的兴趣度,再与网站主题的平均兴趣度水平作对比,从而发现用户的兴趣所在。实验表明该方法是有效的。  相似文献   

8.
Weblogs: simplifying web publishing   总被引:1,自引:0,他引:1  
Lindahl  C. Blount  E. 《Computer》2003,36(11):114-116
A weblog - blog, for short - is a Web site that uses a dated log format to publish periodical information. The updates are frequent, usually daily, according to the site owner's editorial purposes - or whims. Blogs contribute to Web content by linking and filtering evolving content in a structured way and by establishing interlinked communities - the blogosphere - connecting people through shared interests. Bloggers can link to news feeds, personal journals, and topic-specific blogs of almost every sort. Blogging systems are emerging tools that make it easier to set up a blog; to update, distribute, and archive its information; and to enhance its functionality.  相似文献   

9.
熊忠阳  蔺显强  张玉芳  牙漫 《计算机工程》2013,(12):200-203,210
网页中存在正文信息以及与正文无关的信息,无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响,从网页的结构特征和文本特征出发,提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素,完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块,依据各个块的文本特征将其区分为链接块与文本块,并利用噪音块连续出现的结果完成对正文部分的定位,得到网页正文信息。实验结果表明,该方法能够快速准确地提取网页的正文内容。  相似文献   

10.
基于WEB文本数据挖掘的研究   总被引:8,自引:0,他引:8  
万维网是一个巨大的、分布广泛和全球性的信息服务中心,它涉及新闻、广告、消费信息、金融管理、教育、政府、电子商务和许多其他信息服务。Web文本挖掘系统是挖掘技术的重要应用方向,它是指在给定的分类体系下,根据网页的内容自动判别内容类别的过程。  相似文献   

11.
The significance of modeling and measuring various attributes of the Web in part or as a whole is undeniable. Modeling information phenomena on the Web constitutes fundamental research towards an understanding that will contribute to the goal of increasing its utility. Although Web related metrics have become increasingly sophisticated, few employ models to explain their measurements. In this paper, we discuss issues related to metrics for Web page significance. These metrics are used for ranking the quality and relevance of Web pages in response to user needs. We focus on the problem of ascertaining the statistical distribution of some well-known hyperlink-based Web page quality metrics. Based on empirical distributions of Web page degrees, we derived analytically the probability distribution for the PageRank metric. We found out that it follows the familiar inverse polynomial law reported for Web page degrees. We verified the theoretical exercise with experimental results that suggest a highly concentrated distribution of the metric.  相似文献   

12.
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM'S practical applicability.  相似文献   

13.
Cellular phones are widely used to access the Web. However, most available Web pages are designed for desktop PCs, and it is inconvenient to browse these large Web pages on a cellular phone with a small screen and poor interfaces. Users who browse a Web page on a cellular phone have to scroll through the whole page to find the desired content, and must then search and scroll within that content in detail to get useful information. This paper describes the design and implementation of a novel Web browsing system for cellular phones. This system includes a Web page overview to reduce scrolling operations when finding objective content within the page. Furthermore, it adaptively presents content according to its characteristics to reduce burdensome operations when searching within content.  相似文献   

14.
In this paper we outline the use of term rewriting techniques for modeling the dynamic behavior of Web sites. We associate rewrite rules to each Web page expressing the Web pages which are immediately reachable from this page. The obtained system permits the application of well-known results from the rewriting theory to analyse interesting properties of the Web site. In particular, we briefly discuss the use of some logics with strong connections with term rewriting as a basis for specifying and verifying dynamic properties of Web sites. We use Maude as a suitable specification language for such rewriting models which also permits to directly explore interesting dynamic properties of Web sites.  相似文献   

15.
传统的Web技术,很难实现用户提交数据后,在服务器处理数据的同时,将处理进度的同步信息显示在网页中,尤其是在不知网页局部内容更新次数的情况下,此问题更显凸出。本文使用函数递归调用的方式以异步获取系统信息、扩展Web系统应用能力以提高人机交互的表现力,并将处理的过程形象地展示给用户,有效地解决此问题。实验结果表明,相对于传统的一个Web线程只能服务一个HTTP请求,本文研究可以实现多个Web线程同时对一个请求进行服务。   相似文献   

16.
在虚拟网页技术基础上,借鉴模块化程序设计思想,提出了Web页面模块化设计方法。将虚拟网页技术与模块化相结合,可显著改变信息的组织与存储方式,具有支持模块级的网页设计复用、快速重组、扩展与更新等显著特点。  相似文献   

17.
当我们浏览网页时,在访问速度方面静态网页要明显比动态网页快得多,因此把一些关键性或经常访问的页面使用静态页技术做成静态页至关重要。在介绍什么是静态页生成技术之后分别以发布新闻和首页新闻条目处如何设计为例对静态页的生成作了详细的阐述,其中主要使用了文件对象来完成对文件生成、读取等操作,使用的技术为ASP。  相似文献   

18.
With the rapid changes in dynamic web pages, there is an increasing need for receiving instant updates for dynamic blocks on the Web. In this paper, we address the problem of automatically following dynamic blocks in web pages. Given a user-specified block on a web page, we continuously track the content of the block and report the updates in real time. This service can bring obvious benefits to users, such as the ability to track top-ten breaking news on CNN, the prices of iPhones on Amazon, or NBA game scores. We study 3,346 human labeled blocks from 1,127 pages, and analyze the effectiveness of four types of patterns, namely visual area, DOM tree path, inner content and close context, for tracking content blocks. Because of frequent web page changes, we find that the initial patterns generated on the original page could be invalidated over time, leading to the failure of extracting correct blocks. According to our observations, we combine different patterns to improve the accuracy and stability of block extractions. Moreover, we propose an adaptive model that adapts each pattern individually and adjusts pattern weights for an improved combination. The experimental results show that the proposed models outperform existing approaches, with the adaptive model performing the best.  相似文献   

19.
Providing an effective mobile search service is a difficult task given the unique characteristics of the mobile space. Small-screen devices with limited input and interaction capabilities do not make ideal search devices. In addition, mobile content, by its concise nature, offers limited indexing opportunities, which makes it difficult to build high-quality mobile search engines and indexes. In this paper we consider the issue of limited page content by evaluating a heuristic content enrichment framework that uses standard Web resources as a source of additional indexing knowledge. We present an evaluation using a mobile news service that demonstrates significant improvements in search performance compared to a benchmark mobile search engine.  相似文献   

20.
王冠  裘正定 《微机发展》2005,15(3):136-138,141
AIP(All day Information Pursue)平台,即全天候信息跟踪平台,作为关注多方面消息的企业或团体查看Internet上新信息的一种解决方案,弥补了搜索引擎一些方面的不足。它能够从Internet上获取每日的新信息,利用网页自动分类去除不相关文章。通过此平台.用户可以按时间、按类别来查看信息,也可以对文章加以标注推荐给别人阅读。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号