首页 | 本学科首页   官方微博 | 高级检索  
     

自动粒度选择的半结构化页面信息抽取
引用本文:王晓斌,王鹏坡,石昭祥.自动粒度选择的半结构化页面信息抽取[J].计算机工程与应用,2009,45(6):165-167.
作者姓名:王晓斌  王鹏坡  石昭祥
作者单位:解放军电子工程学院,网络工程系602教研室,合肥,230037
摘    要:半结构化页面的数据记录间存在结构相似性,在先序遍历DOM树生成的标记序列中表现为重复出现的模式,可利用后缀树进行挖掘。由于标记序列可以在块粒度和文本粒度两个级别上展现,而不同粒度下产生的最佳抽取模式在抽取效果方面又表现出不确定性,因此提出一种自动粒度选择的半结构化页面信息抽取方法。算法从后缀树获取的重复模式中选取最大重复和串联重复构成候选模式集,通过特征参数确定两个粒度各自的最佳模式集,最后引入抽取结果规则度参数并进行综合评价,以确定抽取模式完成半结构化页面数据记录的自动抽取。

关 键 词:信息抽取  重复模式挖掘  粒度分析  后缀树
收稿时间:2008-1-14
修稿时间:2008-4-15  

Semi-structure page information extraction algorithm with automatic granularity selection
WANG Xiao-bin,WANG Peng-po,SHI Zhao-xiang.Semi-structure page information extraction algorithm with automatic granularity selection[J].Computer Engineering and Applications,2009,45(6):165-167.
Authors:WANG Xiao-bin  WANG Peng-po  SHI Zhao-xiang
Affiliation:WANG Xiao-bin,WANG Peng-po,SHI Zhao-xiang Department of Network Engineering,Electronic Engineering Institute,Hefei 230037,China
Abstract:Data records of simi-structure Web page are similar in structure.This virtue represents as repeat tag strings in the tag sequence of first order traversing DOM tree,which generally can be mined through constructing a suffix-tree.Since the tag sequence can be generated both in block tag level and text tag level,and the different granularity patterns’ performances play an uncertain way,a semi-structure information extraction algorithm with automatic granularity selection is introduced in this paper.Firstly,it generates two different granularity candidate pattern collections by search maximal repeats and tandem repeats in respective suffix-trees,and then evaluates these patterns by statistic metrics.A new metric of extraction result regularity and a weighted approach are introduced for selecting the target pattern.
Keywords:information extraction  repeat pattern mining  granularity analysis  suffix-tree
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号