首页 | 本学科首页   官方微博 | 高级检索  
     

基于标准XML的Web信息高效抽取算法
引用本文:王奔.基于标准XML的Web信息高效抽取算法[J].湖北工业大学学报,2010,25(2):63-67.
作者姓名:王奔
作者单位:湖北工业大学计算机学院,湖北,武汉,430068
摘    要:讨论了一种基于XML在网络中抽取信息的方法.理想的数据抽取过程是仅仅分析由HTML页面组成的网站数据库.然而,全面的信息抽取过程需要面对许多障碍.正确的数据抽取还需要有可靠的数据验证和错误恢复服务,以应对无法避免的数据抽取故障.提出一个名为NIES的软件框架,它可以大大提高网络信息抽取的效率和准确度,保证了网络信息抽取的质量.NIES的关键部分是用XML技术来进行数据抽取,它包含了XHTML和XSLT并且支持连接"深度网络".

关 键 词:NIES爬虫  深度网络  网络数据抽取

On High-efficiency Web-information Extraction Algorithm Based on XML
WANG Ben.On High-efficiency Web-information Extraction Algorithm Based on XML[J].Journal of Hubei University of Technology,2010,25(2):63-67.
Authors:WANG Ben
Affiliation:WANG Ben(School of Computer Science,Hubei Univ.of Technology,Wuhan 430068,China)
Abstract:The methodology of extracting the Web information based on XML is described.The ideal way of Web data extraction is to analyze the database which includes HTML pages only.However,the whole process will meet so much trouble.Data extraction needs data validation and error recovery to face the failures which may happen in Web data extraction.In this paper,a software framework named NIES is presented,which is able to enhance the efficiency of extracting the information in Web and guarantee the quality of the extracting.The key part of NIES is to extract information by using XML technology,which includes XHTML and XSLT and connects "deep Web".
Keywords:NIES crawling  deep Web  Web data extraction
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号