共查询到20条相似文献,搜索用时 62 毫秒
1.
This paper describes a fast HTML web page detection approach that saves computation time by limiting the similarity computations between two versions of a web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that could serve the needs of users in monitoring modifications to HTML web pages made over time, and that allow for reporting and visualizing changes and trends in order to gain insight about the significance and types of such changes. The detection of changes across two versions of a page is accomplished by performing similarity computations after transforming the web page into an XML-like structure in which a node corresponds to an open–close HTML tag. Performance and detection reliability results were obtained, and showed speed improvements when compared to the results of a previous approach. 相似文献
2.
在Web数据挖掘中,由于网页大多都含有指向其他页面的超链接等噪音信息,为了减少噪音信息对Web数据挖掘效果的影响,有必要对网页进行净化处理,提取其中的正文,同时,现实中很多网页的代码结构不是特别规范,对此,提出一种对灵活结构网页适用的正文抽取算法。将网页用HTML标签分割成节点形式,找出其中含有正文内容的一个节点,以此节点为基础向前和向后进行余下正文内容的抽取。实验结果表明,本算法的适用性强、正确率较高。 相似文献
3.
Selma Ayşe Özel 《Expert systems with applications》2011,38(4):3407-3415
The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naïve Bayes and k nearest neighbor classifiers. 相似文献
4.
5.
提出了一种基于网页框架和规则的网页去除噪音的新方法,该方法根据网页中HTML标签将网页分成若干部分,对各个table的长宽比属性进行比较,去掉长宽比很大的部分,并对其余table中的内容进行分析,根据内部是否存在和段落文字有关的标签