自动粒度选择的半结构化页面信息抽取 Semi-structure page information extraction algorithm with automatic granularity selection期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

自动粒度选择的半结构化页面信息抽取

引用本文：	王晓斌,王鹏坡,石昭祥.自动粒度选择的半结构化页面信息抽取[J].计算机工程与应用,2009,45(6):165-167.

作者姓名：	王晓斌王鹏坡石昭祥

作者单位：	解放军电子工程学院,网络工程系602教研室,合肥,230037

摘要：	半结构化页面的数据记录间存在结构相似性,在先序遍历DOM树生成的标记序列中表现为重复出现的模式,可利用后缀树进行挖掘。由于标记序列可以在块粒度和文本粒度两个级别上展现,而不同粒度下产生的最佳抽取模式在抽取效果方面又表现出不确定性,因此提出一种自动粒度选择的半结构化页面信息抽取方法。算法从后缀树获取的重复模式中选取最大重复和串联重复构成候选模式集,通过特征参数确定两个粒度各自的最佳模式集,最后引入抽取结果规则度参数并进行综合评价,以确定抽取模式完成半结构化页面数据记录的自动抽取。
关键词：	信息抽取重复模式挖掘粒度分析后缀树
收稿时间：	2008-1-14
修稿时间：	2008-4-15
Semi-structure page information extraction algorithm with automatic granularity selection

WANG Xiao-bin,WANG Peng-po,SHI Zhao-xiang.Semi-structure page information extraction algorithm with automatic granularity selection[J].Computer Engineering and Applications,2009,45(6):165-167.

Authors:	WANG Xiao-bin WANG Peng-po SHI Zhao-xiang

Affiliation:	WANG Xiao-bin,WANG Peng-po,SHI Zhao-xiang Department of Network Engineering,Electronic Engineering Institute,Hefei 230037,China

Abstract:	Data records of simi-structure Web page are similar in structure.This virtue represents as repeat tag strings in the tag sequence of first order traversing DOM tree，which generally can be mined through constructing a suffix-tree.Since the tag sequence can be generated both in block tag level and text tag level，and the different granularity patterns’ performances play an uncertain way，a semi-structure information extraction algorithm with automatic granularity selection is introduced in this paper.Firstly，it generates two different granularity candidate pattern collections by search maximal repeats and tandem repeats in respective suffix-trees，and then evaluates these patterns by statistic metrics.A new metric of extraction result regularity and a weighted approach are introduced for selecting the target pattern.

Keywords:	information extraction repeat pattern mining granularity analysis suffix-tree
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏