应用聚类技术分类提取Web页面 Application of Clustering Technology Category Extraction Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

应用聚类技术分类提取Web页面

引用本文：	崔慧超,刘莉.应用聚类技术分类提取Web页面[J].数字社区&智能家居,2010(1).

作者姓名：	崔慧超刘莉

作者单位：	西南交通大学信息科学与技术学院;

摘要：	针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。
关键词：	Web信息提取树编辑距离聚类提取规则
Application of Clustering Technology Category Extraction Web Pages

CUI Hui-chao,LIU Li.Application of Clustering Technology Category Extraction Web Pages[J].Digital Community & Smart Home,2010(1).

Authors:	CUI Hui-chao LIU Li

Affiliation:	Southwest Jiaotong University;College of Information Science and Technology;Chengdu 610031;China

Abstract:	According to the characteristic of data-intensive dynamic web pages, insufficient text data and page structure with a high degree in web, this paper outlines a web information extraction method based on HTML structure.This method first parses the de-noised web pages to form DOM trees, then with tree edit distance calculates the similarity between pages, clusters pages, generates the corresponding extraction rules for each category and implements web information extraction.

Keywords:	Web information extraction edited tree distance clustering extraction rules
本文献已被 CNKI 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏