基于CURE算法的网页分块及正文块提取研究 approach based on CURE algorithm of Web page segmentation and information extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于CURE算法的网页分块及正文块提取研究

引用本文：	王超,徐杰锋. 基于CURE算法的网页分块及正文块提取研究[J]. 微型机与应用, 2012, 31(12): 11-14

作者姓名：	王超徐杰锋

作者单位：	中国石油大学（华东）计算机与通信工程学院计算机科学与技术系,山东青岛,266000

摘要：	研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。
关键词：	Web信息抽取聚类算法页面分块正文块提取
approach based on CURE algorithm of Web page segmentation and information extraction

Wang Chao,Xu Jiefeng. approach based on CURE algorithm of Web page segmentation and information extraction[J]. Microcomputer & its Applications, 2012, 31(12): 11-14

Authors:	Wang Chao Xu Jiefeng

Affiliation:	(Computer Science and Technology Department,College of Computer and Communication Engineering,China University of Petroleum,Qingdao 266000,China)

Abstract:	This paper discusses an approach based on CURE algorithm of Web pages segmentation and text extraction rules. The main idea is to add attributes to nodes of a standardization DOM tree to convert it into the extended DOM tree with the information node offset. Subsequently,we use the CURE algorithm to cluster information nodes. And each result of the cluster represent different block of the page. Finally,we extracts three main features of the text block and construct information weights formula which can distinguish text blocks.

Keywords:	Web information extraction clustering algorithm page block text block extraction
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏