首页 | 本学科首页   官方微博 | 高级检索  
     


Information extraction from massive Web pages based on node property and text content
Authors:Hai-yan WANG  Pan CAO
Affiliation:1. School of Computer Science and Technology,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;2. Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks,Nanjing 210003,China
Abstract:To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
Keywords:Web information  extraction  MapReduce  DOM tree  
点击此处可从《通信学报》浏览原始摘要信息
点击此处可从《通信学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号