首页 | 本学科首页   官方微博 | 高级检索  
     

基于标题机器学习的网页分割方法
引用本文:李进生,乐惠骁,童名文.基于标题机器学习的网页分割方法[J].计算机科学,2018,45(Z6):583-587.
作者姓名:李进生  乐惠骁  童名文
作者单位:武汉市广播电视大学现代教育技术中心 武汉430033,华中师范大学教育信息技术学院 武汉430079,华中师范大学教育信息技术学院 武汉430079
基金项目:本文受教育部人文社科基金资助
摘    要:针对已有网页分割方法都基于文档对象模型实现且实现难度较高的问题,提出了一种采用字符串数据模型实现网页分割的新方法。该方法通过机器学习获取网页标题的特征,利用标题实现网页分割。首先,利用网页行块分布函数和网页标题标签学习得到网页标题特征;然后,基于标题将网页分割成内容块;最后,利用块深度对内容块进行合并,完成网页分割。理论分析与实验结果表明,该方法中的算法具有O(n)的时间复杂度和空间复杂度,该方法对于高校门户、博客日志和资源网站等类型的网页具有较好的分割效果,并且可以用于网页信息管理的多种应用中,具有良好的应用前景。

关 键 词:网页分割  标题  行块分布函数  块深度  机器学习

Novel Method of Web Page Segmentation Based on Title Machine Learning
LI Jin-sheng,LE Hui-xiao and TONG Ming-wen.Novel Method of Web Page Segmentation Based on Title Machine Learning[J].Computer Science,2018,45(Z6):583-587.
Authors:LI Jin-sheng  LE Hui-xiao and TONG Ming-wen
Affiliation:Modern Education Technical Center,The Open University of Wuhan,Wuhan 430033,China,School of Education Information Technology,Central China Normal University,Wuhan 430079,China and School of Education Information Technology,Central China Normal University,Wuhan 430079,China
Abstract:To solve the problem that it is difficult to implement the web page segmentation method based on document object model (DOM),a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O(n),and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.
Keywords:Webpage segmentation  Title  Liner block function  Block depth  Machine learning
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号