首页 | 本学科首页   官方微博 | 高级检索  
     

基于WSFT模型的深层网文本获取方法
引用本文:杨贯中,李虹萱.基于WSFT模型的深层网文本获取方法[J].计算机工程与应用,2017,53(18):236-242.
作者姓名:杨贯中  李虹萱
作者单位:湖南大学 信息科学与工程学院,长沙 410082
摘    要:Ajax技术在深层网(Deep Web)网站开发中得到了广泛应用。针对Ajax页面多状态、状态之间强关联的特性,提出一种构建WSFT(带权状态融合树)模型的方法,来进行Ajax页面文本信息预处理。引入了文本特征树作为状态指纹进行状态捕获,优化了当前Ajax页面数据采集方法,同时通过StatusRank方法计算状态转移权值来分析状态迁移信息,最后生成WSFT。实验证明,该方法能有效地获取Ajax页面多状态文本信息,并且有助于后续Web挖掘的重要文本内容提取。

关 键 词:Ajax爬虫  带权状态融合树  文本挖掘  文本特征树  

Approach based on WSFT for crawling deep web
YANG Guanzhong,LI Hongxuan.Approach based on WSFT for crawling deep web[J].Computer Engineering and Applications,2017,53(18):236-242.
Authors:YANG Guanzhong  LI Hongxuan
Affiliation:School of Information Science and Engineering, Hunan University, Changsha 410082, China
Abstract:Ajax technology has been widely applied in deep web application development. This paper constructs a Weighted State Fusion Tree (WSFT) model to pre-process the text information in web page with Ajax technology which has multiple states with strong correlation. Firstly, the current approach of Ajax page data collection is optimized by regarding text feature tree as a fingerprint to traverse through the multiple states. Secondly, the transition weight with StatusRank method is calculated for each states of the Ajax page. The state transition information is analyzed. Finally, a WSFT is generated. The experimental results show that the proposed method can effectively obtain the text information in Ajax page with multiple states, and help the follow-up important text extraction of web mining.
Keywords:Ajax crawler  weighted state fusion tree  text mining  text feature tree  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号