首页 | 本学科首页   官方微博 | 高级检索  
     

基于词频差异特征选取的Context Graph算法改进
引用本文:张 永,吴崇正. 基于词频差异特征选取的Context Graph算法改进[J]. 计算机工程与应用, 2014, 50(10): 141-146
作者姓名:张 永  吴崇正
作者单位:兰州理工大学 计算机与通信学院,兰州 730050
摘    要:为了解决传统主题爬虫效率偏低的问题,在分析了启发式网络爬虫搜索算法Context Graph的基础上,提出了一种改进的Context Graph爬虫搜索策略。该策略利用基于词频差异的特征选取方法和改进后的TF-IDF公式对原算法进行了改进,综合考虑了网页不同部分的文本信息对特征选取的影响,及特征词的类间权重和类中权重,以提高特征选取和评价的质量。实验结果表明,与既定传统方法进行实验对照,改进后的策略效率更高。

关 键 词:主题爬虫  ContextGraph模型  搜索策略  特征选取  TF-IDF  

Improved context graph algorithm by using feature selection based on word fre- quency differentia
ZHANG Yong,WU Chongzheng. Improved context graph algorithm by using feature selection based on word fre- quency differentia[J]. Computer Engineering and Applications, 2014, 50(10): 141-146
Authors:ZHANG Yong  WU Chongzheng
Affiliation:School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
Abstract:In order to solve the low efficiency problem of traditional focused crawler, the heuristic web crawler search algorithm Context Graph is analyzed. However, Context Graph method is deficient. An optimization strategy is proposed by adopting the improved TF-IDF and feature selection method based on word frequency differentia, which takes importance of different web textual content into consideration synthetically. A new method of term weighting is explicated in text categorization which considers feature words among and inside class. Compared with the other given algorithms, experimental results indicate that this strategy is more efficient in crawling the topic pages.
Keywords:TF-IDF  focused crawler  Context Graph  search strategy  feature selection  TF-IDF
本文献已被 CNKI 维普 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号