利用JTidy和XML实现Web数据信息的批量提取 Extracting formatted batch data from web by JTidy and XML期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

利用JTidy和XML实现Web数据信息的批量提取

引用本文：	刘钊夏,何明昕. 利用JTidy和XML实现Web数据信息的批量提取[J]. 计算机工程与设计, 2010, 31(6)

作者姓名：	刘钊夏何明昕

作者单位：	暨南大学,计算机科学系,广东,广州,510632

基金项目：	广东省自然科学基金项目

摘要：	为了有效地在Web上进行数据信息的提取,实现Web数据的清理与集成,针对发布批量格式化数据的网页类型,提出了利用XML和JTidy自动从Web页面批量提取数据信息的方法.根据该类网页的特点,基于开发一种通用程序的思想,对页面标签结构进行分析与分类,讨论了识别数据元素和对数据元素进行分组等提取过程中的难点,在此基础上建立了总体扫描与提取的算法.实验结果表明了批量提取信息方法的可行性与有效性.
关键词：	Web内容提取 JTidy工具包 Dom4j工具包标记路径频繁路径
Extracting formatted batch data from web by JTidy and XML

LIU Zhao-xia,HE Ming-xin. Extracting formatted batch data from web by JTidy and XML[J]. Computer Engineering and Design, 2010, 31(6)

Authors:	LIU Zhao-xia HE Ming-xin

Affiliation:	LIU Zhao-xia,HE Ming-xin(Department of Computer Science,Jinan University,Guangzhou 510632,China)

Abstract:	To extract data information from web effectively and implement web data purification and integration, the approach that automatically extract interested batch data from web pages is presented using XML and JTidy tools.Targeted on the specific web and a general processing idea, page structure is analyzed and classified.The main difficulties in design that are identifying and labeling data element, are discussed and the algorithms of general scanning and extracting are constructed.Finally, a case study of ext...

Keywords:	XML web contentextraction XML JTidy Dom4j label path frequent path
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏