首页 | 本学科首页   官方微博 | 高级检索  
     

维吾尔文网页正文抽取系统的研究与实现
引用本文:蔡李,单艳,薛化建,苏国平. 维吾尔文网页正文抽取系统的研究与实现[J]. 计算机工程与设计, 2012, 33(2): 551-555
作者姓名:蔡李  单艳  薛化建  苏国平
作者单位:1. 中国科学院新疆理化技术研究所,新疆乌鲁木齐830011;中国科学院研究生院,北京100049
2. 中国科学院新疆理化技术研究所,新疆乌鲁木齐,830011
3. 新疆维吾尔自治区经济和信息化委员会,新疆乌鲁木齐,830011
基金项目:中国科学院西部行动计划高新技术基金项目
摘    要:从构建大规模维吾尔文语料库的角度出发,归纳总结各类网页正文抽取技术,提出一种基于文本句长特征的网页正文抽取方法.该方法定义一系列过滤和替换规则对网页源码进行预处理,根据文本句长特征来判断文本段是否为网页正文.整个处理过程不依赖DOM树型结构,克服了基于DOM树结构进行正文抽取方法的性能缺陷.实验结果表明,对于维文各类型的网页正文提取,该方法均具有较高的准确度度和较好通用性.

关 键 词:维吾尔文  网页正文抽取  语料库  文本句长特征  web文本挖掘

Research and implementation of Uyghur web content extraction system
CAI Li , SHAN Yan , XUE Hua-jian , SU Guo-ping. Research and implementation of Uyghur web content extraction system[J]. Computer Engineering and Design, 2012, 33(2): 551-555
Authors:CAI Li    SHAN Yan    XUE Hua-jian    SU Guo-ping
Affiliation:1.Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi 830011,China; 2.Graduate University,Chinese Academy of Science,Beijing 100049,China;3.Economic and Information Commission of Xinjiang Uygur Autonomous Region,Urumqi 830011,China)
Abstract:Starting from the idea of building a large-scale Uyghur corpus,summarizing various web content extraction methods,a web content extraction method based on sentence length feature is presented.Firstly,the web code source is preprocessed with a series filtering and replacing rules.And then the text segment is determined whether to be the web content according to the characteristic of text sentence length.In the whole process,web page source code is analyzed directly,instead of depending on DOM tree structure,therefore the performance shortcomes of the content extraction methods are overcomed based on DOM tree structure.Experimental results show that this method has high reliability and good versatility in content extraction for various types of web pages.
Keywords:Uyghur  web content extraction  corpus  text sentence length characteristic  web text mining
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号