首页 | 本学科首页   官方微博 | 高级检索  
     

正则表达式的Web数据提取研究
引用本文:刘松业. 正则表达式的Web数据提取研究[J]. 电脑编程技巧与维护, 2008, 0(15): 89-91
作者姓名:刘松业
作者单位:华东师范大学信息学院,上海,200062
摘    要:Internet正在日益成为一个重要的信息来源,如何对Web数据进行检索和加工,使得用户能够更好地利用Intemet上的数据资源己经成为了新的研究热点。文中论述了半自动化数据提取算法,其中使用了基于扩展正则表达式的信息槽提取算法和基于网页特性的事件分割算法。同时描述了利用这些算法的信息提取系统,并详细介绍了系统的体系结构和实现细节。该系统可以被用于真实的Web环境中以提高存储、利用信息的效率,在一定程度上解决在Internet上获取信息及利用信息的困难。

关 键 词:数据提取  算法  正则表达式  半结构化数据

Study on Extraction Approach of Web Information Based on Regular Expression
LIU Songye. Study on Extraction Approach of Web Information Based on Regular Expression[J]. Computer Programming Skills & Maintenance, 2008, 0(15): 89-91
Authors:LIU Songye
Affiliation:LIU Songye (Information Schoal East China Normal University, Shanghai 200062)
Abstract:Internet is becoming a very important information resource.It has been a hot field in academic research on how to retrieve and process the Web information to make users utilize the resources on the Interact more effectively and more efficiently. The paper also describes a semi automatic information extraction algorithm, which use the extraction algorithm based on extended regular expression and event split algorithm based on web page features. The algorithm is used in a web recruitment information extraction project and good performance is obtained. Relevant experiments are performed to show the advantages and disadvantages of these algorithms.
Keywords:Web information extraction  algorithms  regular expression  semi structure data
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号