首页 | 本学科首页   官方微博 | 高级检索  
     

基于关键词聚类和节点距离的网页信息抽取
引用本文:邓健爽,郑启伦,彭宏,林旭东.基于关键词聚类和节点距离的网页信息抽取[J].计算机科学,2007,34(4):213-216.
作者姓名:邓健爽  郑启伦  彭宏  林旭东
作者单位:华南理工大学计算机科学与工程学院人工智能实验室,广州510641
基金项目:广东省科技攻关计划 , 广东省广州市科技攻关项目
摘    要:大部分网页信息抽取方法都针对特定的网站,例如基于网站抽取规则和基于训练网页样例的方法。这些方法在某一个网站上可以很好地应用。但当遇到新的网站时,必须人为地增加抽取规则或者提供新的训练网页集。而且,当网站的模版改变时,也要重新设计这些规则或重新输入训练网页集。这些方法难以维护,因此不能应用到从大量不同的网站上进行信息抽取。本文提出了一种新的网页信息抽取方法,该方法基于特定主题的关键词组和节点距离,能够不加区分地对不同的网站页面信息自动抽取。对大量网站的网页进行信息抽取的实验显示,该方法能够不依赖网页的来源而正确和自动地抽取相关信息,并且已经成功应用到电子商务智能搜索和挖掘系统中。

关 键 词:聚类  信息抽取  机器学习  节点距离

Web Pages Information Retrieval Based on Keywords Cluster and Node Instance
DENG Jian-Shuang,ZHENG Qi-Lun,PENG Hong,LIN Xu-Dong.Web Pages Information Retrieval Based on Keywords Cluster and Node Instance[J].Computer Science,2007,34(4):213-216.
Authors:DENG Jian-Shuang  ZHENG Qi-Lun  PENG Hong  LIN Xu-Dong
Affiliation:Department of Computer Science, The South China University of Technology, Guangzhou 510641
Abstract:Many Web information retrieval methods are related to special Web sites, for example, the method based on extracting rules and the one based on training page samples. These methods can do well in a Web site but fail in the others without adding new rules or inputting new training pages manually. Furthermore, if the template of the Web site is changed, it has to redesign the extracting rules or re-inputting the training pages. It is hard to be maintained and used to extract information from large number of different Web sites. In the paper, there is a new method which can extract the useful information from the different sites automatically based on the keywords of a certain topic and the distance of the nodes. Experimental evaluation on a large of Web pages from different Web sites indicates that this method correctly and automatically extracts the information ignoring which Web sites the pages come from. This method has been applied to the system of intelligent searching and mining of electronic business successfully.
Keywords:Cluster  Information retrieval  Machine learning  Instance of node
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号