首页 | 本学科首页   官方微博 | 高级检索  
     

基于SRI的动态网页信息抽取方法
引用本文:朱跃林,戴昌林,高志强.基于SRI的动态网页信息抽取方法[J].重庆工学院学报,2009,23(10):87-93.
作者姓名:朱跃林  戴昌林  高志强
作者单位:[1]无锡中航恒信工程管理咨询有限公司,江苏无锡214001 [2]东南大学计算机科学与工程学院,南京210096
基金项目:国家自然科学基金资助项目(60873153,60803061)
摘    要:提出了基于相似记录项归纳的动态网页信息抽取方法.该方法采用编辑距离算法和树排列算法归纳产生记录项的包装器树.对各种类型网页进行信息抽取实验,取得98.11%的召回率和96.90%的准确率.

关 键 词:动态网页  信息抽取  包装器  DOM树

Information Extraction Method for Dynamic Web Pages Based on Similar Records Induction
ZHU Yue-Lin DAI Chang-Lin GAO Zhi-Qiang Hengxin Engineering Consultants Co.Ltd.,Wuxi ,China,.School of Computer Science , Engineering,Southeast University,Nanjing ,China.Information Extraction Method for Dynamic Web Pages Based on Similar Records Induction[J].Journal of Chongqing Institute of Technology,2009,23(10):87-93.
Authors:ZHU Yue-Lin DAI Chang-Lin GAO Zhi-Qiang Hengxin Engineering Consultants CoLtd  Wuxi  China  School of Computer Science  Engineering  Southeast University  Nanjing  China
Affiliation:ZHU Yue-Lin1 DAI Chang-Lin2 GAO Zhi-Qiang2(1.CAPDI(WuXi) Hengxin Engineering Consultants Co.Ltd.,Wuxi 214001,China,2.School of Computer Science , Engineering,Southeast University,Nanjing 210096,China)
Abstract:Dynamic Web pages are pages which are generated by programs automatically.It is estimated that most Web pages exist in the form of dynamic web pages.This paper puts forward an extraction method based on similar records induction(SRI),which uses string editing distance algorithm and DOM tree alignment algorithm to generate record wrapper.Experimental results show that the extraction method gets a recall of 98.11% and a precision of 96.90% for all kinds of dynamic Web pages.
Keywords:dynamic Web page  information extraction  wrapper  DOM tree  
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号