首页 | 本学科首页   官方微博 | 高级检索  
     

基于链接分块的相关链接提取方法
引用本文:王芳,于浩,谭红叶,赵铁军. 基于链接分块的相关链接提取方法[J]. 计算机工程与应用, 2006, 42(31): 110-113
作者姓名:王芳  于浩  谭红叶  赵铁军
作者单位:哈尔滨工业大学,计算机学院,机器智能与翻译研究室,哈尔滨,150001;哈尔滨工业大学,计算机学院,机器智能与翻译研究室,哈尔滨,150001;哈尔滨工业大学,计算机学院,机器智能与翻译研究室,哈尔滨,150001;哈尔滨工业大学,计算机学院,机器智能与翻译研究室,哈尔滨,150001
基金项目:富士通研发中心有限公司资助项目
摘    要:每个网页都包含了大量的超链接,其中既包含了相关链接,也包含了大量噪声链接。提出了一种基于链接分块的相关链接提取方法。首先,将网页按照HTML语言中标签将网页分成许多的块,从块中提取链接,形成若干链接块;其次,根据相关链接的成块出现,相关链接文字与其所在网页标题含相同词等特征,应用规则与统计相结合的方法从所有链接块中提取相关链接块。相关链接提取方法测试结果,精确率在85%以上,召回率在70%左右,表明该方法很有效。

关 键 词:网页分块  链接块  相关链接提取
文章编号:1002-8331(2006)31-0110-04
收稿时间:2006-01-01
修稿时间:2006-01-01

Relation Links Extracted Approach Based on Blocking Links
WANG Fang,YU Hao,TAN Hong-ye,ZHAO Tie-jun. Relation Links Extracted Approach Based on Blocking Links[J]. Computer Engineering and Applications, 2006, 42(31): 110-113
Authors:WANG Fang  YU Hao  TAN Hong-ye  ZHAO Tie-jun
Affiliation:Machine Intelligence and Translation Laboratory,Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Abstract:There are lots of hyper links in a web page,including relation links and"noisy" links.A novel approach is proposed to extract relation links from page based on link block in this paper.The approach is composed of two steps.Firstly,a web page is partitioned into lots of blocks according to HTML tag in a web page.Then links are extracted from blocks and lots of link blocks are gotten.Secondly,relation link block is obtained by using rules.For instance,relation link belongs to one block and their anchor text has common words with title of current page where relation link is located.The result of experiment shows that the method is effective,with above 85% precise rate and about 70% recall rate.
Keywords:page segmentation  link block   relation link extraction
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号