基于后缀树的Web论坛信息抽取 Information extraction for web forum based on suffix tree期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于后缀树的Web论坛信息抽取

引用本文：	肖建鹏,张来顺,任星,宋晓光.基于后缀树的Web论坛信息抽取[J].计算机工程与设计,2008,29(7):1675-1677.

作者姓名：	肖建鹏张来顺任星宋晓光

作者单位：	1. 解放军信息工程大学电子技术学院,河南郑州,450004 2. 中国人民解放军65012部队,辽宁沈阳,110101

摘要：	针对现有网上论坛信息抽取的不足,提出一种基于后缀树的论坛信息抽取方法.将标准化后的HTML文档转换为后缀树,查找出其中的重复模式并产生分装器,将分装器转换为NFA(非确定型有穷自动机)达到抽取论坛信息的目的.该方法运用构造后缀树的技术来抽取论坛信息,较好地解决了现有的抽取方法准确性较差、通用性不强的问题.实验结果表明,该方法具有较高的准确性和实用性.
关键词：	信息抽取分装器后缀树重复模式
文章编号：	1000-7024(2008)07-1675-03
修稿时间：	2007年5月2日
Information extraction for web forum based on suffix tree

XIAO Jian-peng,ZHANG Lai-shun,REN Xing,SONG Xiao-guang.Information extraction for web forum based on suffix tree[J].Computer Engineering and Design,2008,29(7):1675-1677.

Authors:	XIAO Jian-peng ZHANG Lai-shun REN Xing SONG Xiao-guang

Affiliation:	XIAO Jian-peng1,ZHANG Lai-shun1,REN Xing1,SONG Xiao-guang2(1.Institute of Electronic Technology,PLA Information Engineering University,Zhengzhou 450004,China,2.China PLA Troop 65012,Shenyang 110101,China)

Abstract:	Aimed at the limitation of the current methods to extract the web forum information,an information extraction method for web forum based on suffix tree is proposed.First,the HTML files standardize is converted to the suffix trees,then check to find out the repeat mode and build the wrapper,finally the wrapper is converted to the NFA to attains the aim of extract the web forum information.The method uses the suffix tree technology to extract the web forum information.The method has more accurate and applicab...

Keywords:	information extraction wrapper suffix tree repeated pattern forum
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏