首页 | 本学科首页   官方微博 | 高级检索  
     

基于Web挖掘的化学物质信息提取应用研究
引用本文:冯硕,李书琴,杨会君.基于Web挖掘的化学物质信息提取应用研究[J].计算机工程与设计,2012,33(8):3040-3046.
作者姓名:冯硕  李书琴  杨会君
作者单位:西北农林科技大学信息工程学院,陕西杨凌,712100
基金项目:公益性行业(环保)科研专项基金项目(200909086)
摘    要:针对多信息源网站中化学物质信息的获取与数据库的更新查询问题,运用网络爬虫技术和包装器方法实现数据的抽取;采用自定义XML文件的方式,提出了任务分割、动态更新检查、失败重试机制方法,实现了动态信息源网站中化学物质信息的持续、实时抽取,并进行异常处理和监控。将抽取的数据运用正则表达式和排序算法进行预处理并构建全面而准确的化学品环境安全数据库,最终实现了对原有数据的更新查询,在一定程度上保证了可靠性、可用性、可扩展性、可维护性。

关 键 词:Web信息抽取  任务分割  重试机制  持续抽取  数据预处理

Application research on chemical information extraction based on web data mining
FENG Shuo , LI Shu-qin , YANG Hui-jun.Application research on chemical information extraction based on web data mining[J].Computer Engineering and Design,2012,33(8):3040-3046.
Authors:FENG Shuo  LI Shu-qin  YANG Hui-jun
Affiliation:(College of Information Engineering,Northwest Agriculture and Forestry University,Yangling 712100,China)
Abstract:To solve the problems of chemical substance information acquisition from Multi-source website,database update and database query,the technology of web crawler and the method of the wrapper are used to extract data,and methods of task partitioning,dynamic updating inspection and failure retry mechanism is proposed by introducing the user-defined xml file to implement continuous and real-time extraction,exception handling and monitoring of Chemical information in the information source website.Moreover,extracted data is pretreated by regular expression and sorting algorithmand built a comprehensive and accurate database of environmental safety of chemicals,finally to updating and querying the original database.A certain degree of reliability,availability,extendibility and maintainability is guaranteed.
Keywords:web information extraction  task division  retry strategy  continuous extraction  data pretreatment
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号