首页 | 本学科首页   官方微博 | 高级检索  
     

逆序解析DOM树及网页正文信息提取
引用本文:张瑞雪,宋明秋,公衍磊.逆序解析DOM树及网页正文信息提取[J].计算机科学,2011,38(4):213-215,225.
作者姓名:张瑞雪  宋明秋  公衍磊
作者单位:大连理工大学系统工程研究所,大连,116023
基金项目:本文受国家自然科学基金项目(70671016)资助。
摘    要:一般地,从HTML网页中提取正文信息,应先将HTML、网页解析成DOM树,然后遍历DOM树,依据目标信息在DOM树中的分布规律,将信息从DOM树中提取。这种传统方法将解析DOM树和从DOM树中提取信息看成两个独立的过程,制约了提取信息的速度。事实上,在准确提取目标信息的过程中,独立解析整个DOM树是没有必要的。在此,提出了逆序解析DOM树算法,并结合DOM树相似理论和传统的顺序解析算法,从部分目标信息开始分别向后顺序和向前逆序解析DOM树,同时定位并获取其他目标信息。利用该方法提取网页正文信息,一方面只需解析部分DOM树,从而减少了解析树结构花费的时间,另一方面不需要遍历整个DOM树查找目标信息,从而节省了查找时间,大大提高了信息提取速度。最后,通过实验证实了该方法的优越性。

关 键 词:DOM树,网页正文提取,结构相似性,逆序解析

Parsing DOM Tree Reversely and Extracting Web Main Page Information
ZHANG Rui-xue,SONG Ming-qiu,GONG Yan-lei.Parsing DOM Tree Reversely and Extracting Web Main Page Information[J].Computer Science,2011,38(4):213-215,225.
Authors:ZHANG Rui-xue  SONG Ming-qiu  GONG Yan-lei
Affiliation:(Institute of System Engineering,Dalian University of Technology,Dalian 116023,China)
Abstract:To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method.
Keywords:DOM tree  Web content extracting  Structural similarity  Parsing reversely
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号