首页 | 本学科首页   官方微博 | 高级检索  
     

基于启发式规则优化的网页元素提取方法
引用本文:宋健豪,赵刚. 基于启发式规则优化的网页元素提取方法[J]. 信息安全与技术, 2012, 3(6): 66-69
作者姓名:宋健豪  赵刚
作者单位:北京信息科技大学信息管理学院,北京,100192
基金项目:北京市属高等学校人才强教计划资助项目(PHR201106133);北京信息科技大学大学生科技创新计划
摘    要:网页信息提取方法中的启发式规则,是识别网页标签信息、利用网页节点分析结果、针对网页不同内容、完成信息提取的重要手段。本研究在对现有启发式规则进行研究分析的基础上,提出了几种优化的启发式规则,实现对网页标题、发布时间、来源以及正文内容等元素信息的精准提取。本研究进一步提出了运用编辑距离算法实现正文内容提取准确率的判定,并提出阙值优化方法,克服了正文提取中噪声节点多、内容识别不完全的缺陷,大大提高了提取的准确度。

关 键 词:启发式规则  优化  网页元素  精准提取

Heuristic-Optimizing Based Webpage Elements Extraction
Song Jian-hao Zhao Gang. Heuristic-Optimizing Based Webpage Elements Extraction[J]. Information Security and Technology, 2012, 3(6): 66-69
Authors:Song Jian-hao Zhao Gang
Affiliation:Song Jian-hao Zhao Gang (School of Information Management, Beijing Information Science & Technology University Beijing 100192)
Abstract:The heuristic rules of web information extraction method, is an important means for identifying tag's information, analyzing the nodes and different body of pages, and finally extracting the information in the Webpage. Our study based on the existed research of heuristic rules, presents some kinds of optimized heuristic rules, by which the accurate extraction of title, release time, source and body cloud be achieved. Furthermore, this study proposes the use of Levenshtein Distance Method to determination the accuracy of the extraction of body, and proposes the threshold optimization method to overcome the shortage in body extraction resulted from the large number of noise nodes and the incompletely identification of body, which method can be utilized to improve the accuracy of the extracted information greatly.
Keywords:heuristic rules  optimize  webpage elements  accurate extraction
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号