首页 | 本学科首页   官方微博 | 高级检索  
     

一种多策略联合信息抽取方法
引用本文:肖明军,张巍,邹翔,蔡庆生.一种多策略联合信息抽取方法[J].小型微型计算机系统,2005,26(4):614-617.
作者姓名:肖明军  张巍  邹翔  蔡庆生
作者单位:中国科学技术大学,计算机科学与技术系,安徽,合肥,230027
基金项目:国家自然科学基金(70171052)资助
摘    要:介绍了一种多策略联合信息抽取方法——MSCIE(Multi-Strategy Comtbination Information Extraction).MSCIE将对表格式网页的信息抽取分为基于网页结构特征分析的信息抽取和基于模式匹配的信息抽取,提出了一种对网页DoM(Document Object Moclel)树的冗余信息进行剪枝分析的方法和一种实体特征模式发现算法分别用于这两种信息抽取方法,并通过两种策略联合完成信息抽取工作.应用于互联网竞争情报监测系统中,从大量网站中抽取多种商品的供求信息,取得了较高的准确率和召回率(平均在95%以上)。

关 键 词:机器学习  信息抽取  模式匹配  多策略
文章编号:1000-1220(2005)04-0614-04

Multi-Strategy Combination Information Extraction Method
XIAO Ming-jun,ZHANG Wei,ZOU Xiang,CAI Qing-Sheng.Multi-Strategy Combination Information Extraction Method[J].Mini-micro Systems,2005,26(4):614-617.
Authors:XIAO Ming-jun  ZHANG Wei  ZOU Xiang  CAI Qing-Sheng
Abstract:A multi-strategy combination information extraction method,MSCIE(Multi-Strategy Combination Information Extraction),is introduced in this paper. MSCIE divided the information extraction from tabular web pages into the information extraction based on web page structure feature analysis and the information extraction based on pattern matching,also advanced a method of pruning the redundant information in the DOM(Document Object Model) trees of web pages and a feature pattern discovery algorithm which are used in the two information extraction method respectively, and accomplished the information extraction tasks by the two strategy cooperation. MSCIE, applied in the Competitive Intelligence System based on Internet, had extracted the supply and demand information of many products from a mass web sites, and achieved high precision and recall(>95%on average).
Keywords:machine learning  information extraction  pattern matching  Multi-Strategy
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号