首页 | 本学科首页   官方微博 | 高级检索  
     

采用树自动机推理技术的信息抽取方法
引用本文:谭鹏许,张来顺. 采用树自动机推理技术的信息抽取方法[J]. 计算机工程与应用, 2010, 46(16): 153-156. DOI: 10.3778/j.issn.1002-8331.2010.16.045
作者姓名:谭鹏许  张来顺
作者单位:解放军信息工程大学 电子技术学院,郑州 450004
摘    要:提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。

关 键 词:树自动机推理算法  结构化(半结构化)文档  无秩树自动机  信息抽取  KLH树  
收稿时间:2008-11-19
修稿时间:2009-2-18 

Information extraction using tree automata inference technique
TAN Peng-xu,ZHANG Lai-shun. Information extraction using tree automata inference technique[J]. Computer Engineering and Applications, 2010, 46(16): 153-156. DOI: 10.3778/j.issn.1002-8331.2010.16.045
Authors:TAN Peng-xu  ZHANG Lai-shun
Affiliation:Institute of Electronic Technology,the PLA Information Engineering University,Zhengzhou 450004,China
Abstract:This paper proposes an information extraction method based on an improved k-contextual tree automata inference algorithm.The key idea is to transform(semi-) structured documents into tree,creating unranked tree automata which can accept the tree and extract data according to the unranked tree automata state of acceptance and rejection,using an advanced k-contextual tree language,which is called KLH tree language.The method makes full use of the tree structure of the web document and combines the method based on web structure with grammar inference.Experimental results show that the approach with tree automata inference is favorable against some other approach in the learning time and extraction time.
Keywords:tree automata inference algorithm  (semi-)structured documents  unranked tree automata  information extraction  KLH tree language
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号