首页 | 本学科首页   官方微博 | 高级检索  
     

节点频度和语义距离相结合的网页正文信息抽取
引用本文:孟军,刘秋水,王秀坤.节点频度和语义距离相结合的网页正文信息抽取[J].计算机工程与应用,2009,45(1):140-143.
作者姓名:孟军  刘秋水  王秀坤
作者单位:大连理工大学,计算机科学与工程系,辽宁,大连,116023
摘    要:提出了一种带有节点频度的扩展DOM树模型—BF-DOM树模型(Block node Frequency-Document Object Module),并基于此模型进行网页正文信息的抽取。该方法通过向DOM树的某些节点上添加频度和相关度属性来构造文中新的模型,再结合语义距离抽取网页正文信息。方法主要基于以下三点考虑:在同源的网页集合内噪音节点的频度值很高;正文信息一般由非链接文字组成;与正文相关的链接和文章标题有较近的语义距离。针对8个网站的实验表明,该方法能有效地抽取正文信息,召回率和准确率都在96%以上,优于基于信息熵的抽取方法。

关 键 词:信息提取  带有节点频度的文档对象模型树  节点频度  语义距离
收稿时间:2008-7-24
修稿时间:2008-10-16  

Combing node frequency and semantic feature for webpage informative content extraction
MENG Jun,LIU Qiu-shui,WANG Xiu-kun.Combing node frequency and semantic feature for webpage informative content extraction[J].Computer Engineering and Applications,2009,45(1):140-143.
Authors:MENG Jun  LIU Qiu-shui  WANG Xiu-kun
Affiliation:MENG Jun,LIU Qiu-shui,WANG Xiu-kun Department of Computer Science , Engineering,Dalian University of Technology,Dalian 116023,China
Abstract:A new module named BF-DOM tree is proposed in this paper,which extends the Document Object Module Tree by adding two properties,i.e.,block node frequency and relativity,to some nodes.Using this module combined with semantic distance,this method extracts the primary content accurately from the same source based on three facts:noise nodes always have high node frequency property within a given website;primary content blocks are often made up of few link words and many text words;useful links are contained in ...
Keywords:information extraction  Block node Frequency-Document Object Module(BF-DOM) tree  node frequency  semantic distance
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号