共查询到20条相似文献,搜索用时 671 毫秒
1.
基于DOM的网页主题信息的抽取 总被引:1,自引:0,他引:1
随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强。但Web页面的主题信息通常不太明确,抽取主题信息也比较困难。针对这一难题,提出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用的信息,提取主题信息。实验表明,该方法能够准确抽取主题信息。 相似文献
2.
针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。 相似文献
3.
4.
为了获取分散Web页面中隐含信息,设计了Web信息抽取系统。该系统首先使用一种改进的HITS主题精选算法进行信息采集;然后对Web页面的HTML结构进行文档的数据预处理;最后,基于DOM树的XPath绝对路径生成算法来获取被标注结点的XPath表达式,并使用XPath语言结合XSLT技术来编写抽取规则,从而得到结构化的数据库或XML文件,实现了Web信息的定位和抽取。通过一个购物网站的抽取实验证明,该系统的抽取效果良好,可以实现相似Web页面的批量抽取。 相似文献
5.
自从Java小人Duke在网页上跳动起来,World Wide Web就不再死气沉沉,Java一时间火爆全球,成为各类传媒争相报道的热点。 实际上Java是通过编写称之为applet的小程序在需要时下载来扩展Web页面或Web浏览器的功能。但applet存在于Web页面本身的HTML文件之外,它还不能与HTML文档类型、窗体以及窗口对象交互。与给定图像一个特殊矩形一样,Java applet被简单的分配了当前页面中一个空间,然后在自己的空间里做它们的事情。可见在与用户及浏览器环境交互作用的能力上,Java有着 相似文献
6.
Web数据库技术进展 总被引:8,自引:0,他引:8
一、引言 WWW是目前Internet上发展最快的领域,也是Internet网上最重要的信息检索手段。早期的Web页面(Home Page)主要用来传递静态HTML文档,后来由于CGI接口,特别是Java和JavaScript语言的引入,使得Web页面可以方便地传播动态信息。借助Java和JavaScript语言,可以设计出具有动画、声音、图形/图像和各种特殊效果的Web页面。 WWW的主要内容包括超文本传输协议(HTTP)、超文本标记语言(HTML)、通用网关接口(CGI)、Java和JavaScript语言等。 HTTP(Hyper Text Transfer Protocol),是一个专门为Web服务器和Web浏览器之间交换数据而设计的网络协议。它通过规定通用资源定位符(UBLs)使客户端的浏览器与服务器的Web资源建立链接关系,从而奠定 相似文献
7.
张佳荣 《电子制作.电脑维护与应用》2015,(8)
HtmlUnit是一项基于JAVA的开源浏览器工具项目。学习该项目对于深度了解网页原理、浏览器原理非常有效,该项目对HTML文档开展模型抽象,并提供可以调取页面、填写表单、打开超链接的编程接口,使程序像普通浏览器一样与网站进行交互,项目官方网站为http://htmlunit.sourceforge.net。本文主要介绍使用Java原因调用HtmlUnit接口提供的类和方法完成Web页面访问工作。 相似文献
8.
研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。 相似文献
9.
基于本体论的Web信息抽取 总被引:15,自引:0,他引:15
以本体论为基础,以所要提取的信息的层次结构作为信息提取的路径,定义了Web页面的信息项本体,并自动解析生成Web页面的结构本体.通过对这两个本体进行对比,构造了一种归纳学习算法来半自动地生成信息提取规则,对Web页面的信息提取具有较高的效率. 相似文献
10.
11.
Heidy M. Marin-Castro Victor J. Sosa-Sosa Jose F. Martinez-Trinidad Ivan Lopez-Arevalo 《Journal of Intelligent Information Systems》2013,40(1):85-108
The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works. 相似文献
12.
《Journal of Web Semantics》2008,6(4):266-273
Revyu is a live, publicly accessible reviewing and rating Web site, designed to be usable by humans whilst transparently generating machine-readable RDF metadata for the Semantic Web, based on user input. The site uses Semantic Web specifications such as RDF and SPARQL, and the latest Linked Data best practices to create a major node in a potentially Web-wide ecosystem of reviews and related data. Throughout the implementation of Revyu design decisions have been made that aim to minimize the burden on users, by maximizing the reuse of external data sources, and allowing less structured human input (in the form of Web 2.0-style tagging) from which stronger semantics can later be derived. Links to external sources such as DBpedia are exploited to create human-oriented mashups at the HTML level, whilst links are also made in RDF to ensure Revyu plays a first class role in the blossoming Web of Data. In this paper we document design decisions made during the implementation of Revyu, discuss the techniques used for linking Revyu data with external sources, and outline how data from the site is being used to infer the trustworthiness of reviewers as sources of information and recommendations. 相似文献
13.
14.
《Computer Networks and ISDN Systems #》1994,25(3):353-360
This paper describes the tools which are being evaluated at the University of Leeds for use by information providers on the World Wide Web. The paper also gives an introduction to the World Wide Web's client/server architecture and the Hypertext Markup Language (HTML). Information is provided on further sources of information which will assist information providers and trainers of information providers.The paper is intended for new information providers on the World Wide Web and for people who are involved in their training. 相似文献
15.
16.
17.
《Computer Networks》1999,31(11-16):1495-1507
The Web mostly contains semi-structured information. It is, however, not easy to search and extract structural data hidden in a Web page. Current practices address this problem by (1) syntax analysis (i.e. HTML tags); or (2) wrappers or user-defined declarative languages. The former is only suitable for highly structured Web sites and the latter is time-consuming and offers low scalability. Wrappers could handle tens, but certainly not thousands, of information sources. In this paper, we present a novel information mining algorithm, namely KPS, over semi-structured information on the Web. KPS employs keywords, patterns and/or samples to mine the desired information. Experimental results show that KPS is more efficient than existing Web extracting methods. 相似文献
18.
19.
一种基于多叉树的HTML到XML的转换方法 总被引:4,自引:0,他引:4
当前的Web信息大多数都是HTML格式的,由于HTML文件中没有严格的结构性,故很难能用一种有效的方法来检索或提取隐藏其中的数据,针对HTML的这种缺陷,本文提出了基于多叉树的HTML到XML转换方法,把对HTML的信息检索问题转化为对XML的检索问题,以便简化下一步的检索问题。 相似文献
20.
当前的Web信息大多数都是HTML格式的,由于HTML文件中没有严格的结构性,故很难能用一种有效的方法来检索或提取隐藏其中的数据.针对HTML的这种缺陷,本文提出了基于多叉树的HTML到XML转换方法,把对HTML的信息检索问题转化为对XML的检索问题,以便简化下一步的检索问题. 相似文献